Prometheus metamonitoring rules
There's still two concerns here before removing the WIP:
-
The
PrometheusManyRestarts
would actually currently be firing because for some weird reason, the GitLab Prometheus servers record slight temporal deviations of their own startup time, without actually restarting or showing a different value on their/metrics
page. See https://prometheus.gitlab.com/graph?g0.range_input=1h&g0.expr=process_start_time_seconds%7Bjob%3D%22prometheus%22%7D&g0.tab=0 and hover over the series over time. I'm not sure what this can be, and even wondering whether it's a chunk delta encoding bug. Unlikely though, given the amount of fuzz tests we have for that, but I'll have to look into it deeper. -
The current Alertmanager config cannot cope nicely when there is more than one alert output vector element with different title or description. That is sad because that is part of what makes Prometheus alerting so powerful. We'll either have to change the Alertmanager templates to deal correctly with that or downgrade the alerting rules here to meet those expectations.
Merge request reports
Activity
@juliusv Procesesses lying about their age? already?
@pcarranza :D Yeah, it seems that we can't read
/proc/<pid>/stat
live on every scrape on some kernels / machines, but should cache the result so that we don't pick up the ever-so-slight changes. That'll require Prometheus client library changes, but I think I can work around it for now with an extra recording rule that rounds the last digits of the start time away.I worked around the issue with incorrectly reported
process_start_time_seconds
now by reformulating the crashloop alert, but still roughly keeping its meaning.@bjk-gitlab Remaining question would be what we should do about the notification grouping: put in a new receiver / notification template into the Alertmanager config that deals correctly with multiple alert elements and different annotations (preferred for me), or aggregate the alert elements sufficiently here to lose any distinguishing information and use the existing templates.
Again, the problem with the current notification templates is here: https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/blob/master/templates/default/alertmanager.yml.erb#L95-98 (they rely on the
title
and thedescription
annotations to be the same for all alerts in the same routing group).@juliusv I would opt for plan A: deal correctly with multiple alert elements.
I rather not have to push the complexity to the alerts themselves.
@juliusv @bjk-gitlab @pcarranza same situation here - https://gitlab.com/gitlab-com/infrastructure/issues/1436. Annotations do not correctly expand because of multiple alerts.
I don't know is it possible or not, can we just iterate over all alerts in the template?
In particular case for low disk space, it can be something like this:
{{ range query "alerts" }} {{ .Alert.Label.instance }} - {{ .Alert.Value }} {{ end }}
Then preparation of correct and working
title
anddescription
will be fully alert creating person problem. Creating such mechanism for alerts, my aim was to make creating alerts as simple as possible and not to prepare MR to several projects. Now we change onlyrunbooks
project and don't touchgitlab-prometheus
cookbook.@maratkalibek That sounds like the best idea IMO! Should I do a MR for the Alertmanager side as well?
yes, please. Actually I wanted to make it in that way before, but I had a little experience with the go templates that time.
But if I understand correctly, it is changes only on alert side. Alert side prepares annotation where in template we use iteration over all alerts. When I was creating that mechanism, my problem was how to iterate over values. I could get only one value for alert -
.Value
.@maratkalibek Hmmm, that's not exactly how it's meant to work, but now you made me think it does require some rethinking for us here to squeeze it all into a single notification. So normally you would not do a separate query from the alerting rule template to look up the other values. Instead, it goes like this:
- You define
LABELS
andANNOTATION
templates in an alerting rule that give information about exactly one alert vector element. - Alertmanager will group a whole list of related alerts into a single notification.
- While doing so, the Alertmanager notification template has access to N
title
anddescription
annotations, one for each concrete alert. (It also has access to common annotations, but those wouldn't containtitle
ordescription
anymore if they differ between alerts in the same group.) - You would then iterate over each of these in the Alertmanager Slack notification template and show each of them somehow.
The last step wouldn't work well for us at the moment because our
title
annotations are quite long and we probably wouldn't want to fully include each of them in a single Slack notification title. We may want to only show some of the common labels / reformulated common annotations in there and then have fully expanded descriptions of all alerts in the message body. I'll play around with that a bit.- You define
added 27 commits
- c5c634b6...5ccb3a4e - 26 commits from branch
master
- 727d2a4d - Prometheus metamonitoring rules
- c5c634b6...5ccb3a4e - 26 commits from branch
So! While experimenting with different Alertmanager Slack notification templates, I noticed that we are actually currently grouping by
['alertname', 'job', 'instance']
: https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/blob/master/templates/default/alertmanager.yml.erb#L5.We'll probably eventually want to get rid of the per-instance grouping, but for now that means that all the alerting rules in this MR will be totally ok with the existing notification templates, since their annotations don't fan out by anything more than those grouping labels. Ok, there's one minor exception with the
interval
label in thePrometheusScrapingSlowly
alert, but that's only in the description, and unlikely to cause problems (at worst an empty body if two target pools in the same Prometheus have problems at the same time).Also, since @bjk-gitlab has now fixed the clock issues on the Azure machines (right? - metrics look like it at least), I reverted the crashloop alert to its original form.
Should be good to review / merge?
@bjk-gitlab Oh, now a Prometheus node's boot time started drifting again - I assume you haven't applied the timer fix to all machines yet. So this PR is gated on https://gitlab.com/gitlab-com/infrastructure/issues/1531 being resolved.
assigned to @pcarranza
@bjk-gitlab I don't seem to have merge permission (I only have Comment and Close buttons), do you?
added 8 commits
-
a2ac3f24...07f7cc57 - 7 commits from branch
master
- 9d624aa8 - Prometheus metamonitoring rules
-
a2ac3f24...07f7cc57 - 7 commits from branch
- Resolved by Zeger-Jan van de Weg