Skip to content

Gitaly error rate anomaly detection

Andrew Newdigate requested to merge gitaly_error_rate_anomaly_detection into master

This MR is experimental.

It will monitor all Gitaly endpoints and raise a low priority alert if any endpoint is raising errors at a rate exceeding a 12-hour normal (with a confidence of 2 sigma or ~95%).

For example, it will raise an alert if PostUploadPack normally generates an average of 5 errors/s, with a standard deviation of 1 error/s but suddenly starts generating errors at 7 errors/s for a period of 5 minutes.

Additionally, since each endpoint is measured separately, if CommitDiff normally generates an average of 0.5 errors/s with a standard deviation of 0.1 errors/s, it will need to generate errors at 0.7 errors/s in order to raise an alert.

Inspired by https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/

https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/merge_requests/311 ensures that low priority Gitaly alerts, like that generating by MR, will only raise issues in the #gitaly-alerts channel and not in the main #gitaly or #prometheus-alert channels.

I fully expect that it will need tuning but I would rather iterate rapidly than spend extra time guessing what the right values are.

If this experiment is successful, I'd like to add more alerts using the same technique but will focus on this one first.

Edited by Andrew Newdigate

Merge request reports