Add initial Prometheus metrics server to runner manager
What does this MR do?
This adds a new metrics_server
config option. If set to a listening address, a Prometheus HTTP server is started with the following metrics:
- runner business logic metrics (currently only number of currently running builds)
- build version information
- Go process metrics (GC stats, goroutines, memstats, ...)
- general process metrics (FDs, memory, cpu usage, ...)
A full example /metrics
output can be found here: https://gist.github.com/juliusv/5a1972875e9d80654c63d222c480f87e
This depends on the data race fix in https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/352, so it's based ontop of that.
The approach to integrating the metrics_server
configuration option was taken from https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/219.
Why was this MR needed?
We need better insight into the entire CI system, and currently we have no white-box metrics.
Are there points in the code the reviewer needs to double check?
General metrics approach
There were multiple different ways of approaching this. The more "traditional" Prometheus style is to just register all metrics from anywhere in the codebase in the global default Prometheus registry (similar to how logging is often treated as a global concern without dependency-injecting a logger; and analogous to using http.Handle(...)
to register a handler with the global HTTP mux).
However, from reading the rest of this codebase, it seemed to me like an approach with explicit dependencies and less global state would be preferred by the authors. So I'm instantiating a custom Prometheus metrics registry and registering the Go and process metrics collectors with it manually (those are included automatically in the default metrics registry). I also am using the prometheus.Collector
interface to encapsulate the metrics of a type, instead of the simpler approach of just defining global metrics. For an example of the latter, see the example at the top of https://godoc.org/github.com/prometheus/client_golang/prometheus. Generally, this all leads to somewhat more code, but more explicit dependencies and encapsulation.
Configuration reloading
Like in https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/219, the metrics HTTP server does not allow for reconfiguration. In Go, it's also not trivial to stop an already running HTTP server. So right now, only whatever is configured at process startup takes effect. Is that ok? Should it be a command-line flag instead?
Please let me know if this approach is generally going in the right direction. I know there's many other metrics we will want to add over time, but maybe let's start out simple?
Does this MR meet the acceptance criteria?
-
Documentation created/updated - Tests
-
Added for this feature/bug -
All builds are passing
-
-
Branch has no merge conflicts with master
(if you do - rebase it please)
What are the relevant issue numbers?
Merge request reports
Activity
Added 2 commits:
- 2e75f620...4362d0a0 - 2 commits from branch
gitlab-org:master
- 2e75f620...4362d0a0 - 2 commits from branch
Added 4 commits:
- 98968b3b - Fix data races around runner health and build stats
- 226739f1 - Update Prometheus client library vendoring
- adc63d4a - Add Prometheus metrics HTTP server
- 1050bb66 - Add version info Prometheus metric
Toggle commit listAdded 28 commits:
- 1050bb66...c73f6889 - 25 commits from branch
gitlab-org:master
- 56b621f4 - Update Prometheus client library vendoring
- b690b13e - Add Prometheus metrics HTTP server
- 011f756e - Add version info Prometheus metric
Toggle commit list- 1050bb66...c73f6889 - 25 commits from branch
Mentioned in issue gitlab-com/infrastructure#543 (closed)
@ayufan can you or someone in the team comment?
Could you review that?
@juliusv Generally it looks good.
So right now, only whatever is configured at process startup takes effect. Is that ok? Should it be a command-line flag instead?
I think it'll be OK for now. We can think how to make HTTP server able to reload in the future.
Generally, this all leads to somewhat more code, but more explicit dependencies and encapsulation.
I agree with such approach :)
@tmaczukin Excellent, thanks. Anything else to change, or should I remove the WIP and get it merged?
@juliusv We should update configuration docs and add a basic documentation for metrics server and the I think we can merge this. After this we would be able to prepare a RC for 1.8, deploy it to our internal runners and add the runners to our Prometheus instance to check how it's working.
@tmaczukin Thanks, I'm adding documentation now, inspired again by https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/219.
Added 1 commit:
- 24aabf0b - Add documentation about Prometheus HTTP server
Added 1 commit:
- 44d6646c - Add documentation about Prometheus HTTP server
@tmaczukin I added documentation and removed the WIP marker!
Added 1 commit:
- ad7f3f0f - Add documentation about Prometheus HTTP server
@tmaczukin Hm, the build pipeline has been stuck in "Pending" state after the successful Prebuild step for about an hour now: https://gitlab.com/juliusv/gitlab-ci-multi-runner/pipelines/4775447 - did I manage to break it by force-pushing too much or something?
@juliusv no it's most likely the problem with sidekiq builds queuing. Can you restart the
unit tests
build (https://gitlab.com/juliusv/gitlab-ci-multi-runner/builds/5602039)? It fails randomly and this one doesn't seems to be related with your changes😉