Gitaly error rate anomaly detection
This MR is experimental.
It will monitor all Gitaly endpoints and raise a low priority alert if any endpoint is raising errors at a rate exceeding a 12-hour normal (with a confidence of 2 sigma or ~95%).
For example, it will raise an alert if PostUploadPack
normally generates an average of 5 errors/s, with a standard deviation of 1 error/s but suddenly starts generating errors at 7 errors/s for a period of 5 minutes.
Additionally, since each endpoint is measured separately, if CommitDiff
normally generates an average of 0.5 errors/s with a standard deviation of 0.1 errors/s, it will need to generate errors at 0.7 errors/s in order to raise an alert.
Inspired by https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/
https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/merge_requests/311 ensures that low priority Gitaly alerts, like that generating by MR, will only raise issues in the #gitaly-alerts
channel and not in the main #gitaly
or #prometheus-alert
channels.
I fully expect that it will need tuning but I would rather iterate rapidly than spend extra time guessing what the right values are.
If this experiment is successful, I'd like to add more alerts using the same technique but will focus on this one first.