Add initial Prometheus metrics server to runner manager

What does this MR do?

This adds a new metrics_server config option. If set to a listening address, a Prometheus HTTP server is started with the following metrics:

  • runner business logic metrics (currently only number of currently running builds)
  • build version information
  • Go process metrics (GC stats, goroutines, memstats, ...)
  • general process metrics (FDs, memory, cpu usage, ...)

A full example /metrics output can be found here:

This depends on the data race fix in, so it's based ontop of that.

The approach to integrating the metrics_server configuration option was taken from

Why was this MR needed?

We need better insight into the entire CI system, and currently we have no white-box metrics.

Are there points in the code the reviewer needs to double check?

General metrics approach

There were multiple different ways of approaching this. The more "traditional" Prometheus style is to just register all metrics from anywhere in the codebase in the global default Prometheus registry (similar to how logging is often treated as a global concern without dependency-injecting a logger; and analogous to using http.Handle(...) to register a handler with the global HTTP mux).

However, from reading the rest of this codebase, it seemed to me like an approach with explicit dependencies and less global state would be preferred by the authors. So I'm instantiating a custom Prometheus metrics registry and registering the Go and process metrics collectors with it manually (those are included automatically in the default metrics registry). I also am using the prometheus.Collector interface to encapsulate the metrics of a type, instead of the simpler approach of just defining global metrics. For an example of the latter, see the example at the top of Generally, this all leads to somewhat more code, but more explicit dependencies and encapsulation.

Configuration reloading

Like in, the metrics HTTP server does not allow for reconfiguration. In Go, it's also not trivial to stop an already running HTTP server. So right now, only whatever is configured at process startup takes effect. Is that ok? Should it be a command-line flag instead?

Please let me know if this approach is generally going in the right direction. I know there's many other metrics we will want to add over time, but maybe let's start out simple?

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Tests
    • Added for this feature/bug
    • All builds are passing
  • Branch has no merge conflicts with master (if you do - rebase it please)

What are the relevant issue numbers?

Merge request reports