Automated anomaly detection and alerting
Description
As part of a larger alerting solution for our customers, it's important that we are able to detect anomalous behavior of a specific metric or node.
The benefit of this automatic detection is a few fold:
- It does not require specific knowledge of typical or expected behavior. Rather it alerts when current behavior is significantly different than based on past performance. This means it does not require specific configuration, although may require some tweaking.
- Can detect issues that a standard threshold type alert may not detect.
Proposed Solution
Prometheus has built in support for comparing current behavior against a past trend. For example, the 5 minute moving average can be compared against the 1 week moving average, and if it deviates beyond X standard deviations, and alert can be generated.
It is also trivial to compare the behavior of one node against the general population. So if a node is X standard deviations beyond the 5 minute moving average of the rest of the nodes, an alert can be generated. (We should add a grace period for startup/shutdown)
While this is not as advanced as other more advanced statistical models that factor in the peaks and troughs that typically occur throughout a given day/week/year, it is a great start and can provide significant value without any configuration.
The primary drawback is that this requires recording rules and alert manager configuration within Prometheus, which is done by editing files on disk. This will be problematic to automate for the Omnibus packaged and external Prometheus servers, but is notably not a problem for the managed deployment of Prometheus we are adding.
To achieve this, we should:
-
Add support to the managed deployment of Prometheus to configure Alerts and Recording Rules. -
Define a recording rule for a weekly moving average of key metrics: Latency, Error Rate, CPU, Memory. -
Define a recording rule for a 5 minute moving average of all nodes for each metric. -
Create alert for each metric when the 5 minute moving average is X standard deviations beyond the weekly moving average -
Create alert for each metric when a node's 5 minute moving average is X standard deviations beyond the populations average. -
Create a webhook for Prometheus to notify GitLab that an alert has been triggered -
When receiving an alert, GitLab should notify based on the alerting configuration