Prometheus metamonitoring rules (!222) · Merge requests · GitLab.com / runbooks

Julius Volz requested to merge metamon into master Mar 27, 2017

There's still two concerns here before removing the WIP:

The PrometheusManyRestarts would actually currently be firing because for some weird reason, the GitLab Prometheus servers record slight temporal deviations of their own startup time, without actually restarting or showing a different value on their /metrics page. See https://prometheus.gitlab.com/graph?g0.range_input=1h&g0.expr=process_start_time_seconds%7Bjob%3D%22prometheus%22%7D&g0.tab=0 and hover over the series over time. I'm not sure what this can be, and even wondering whether it's a chunk delta encoding bug. Unlikely though, given the amount of fuzz tests we have for that, but I'll have to look into it deeper.
The current Alertmanager config cannot cope nicely when there is more than one alert output vector element with different title or description. That is sad because that is part of what makes Prometheus alerting so powerful. We'll either have to change the Alertmanager templates to deal correctly with that or downgrade the alerting rules here to meet those expectations.

Admin message