Inform end-users and administrators of performance problems
We have an opportunity with the embedded Prometheus metrics to pro-actively inform administrators and potentially end users of performance issues.
These could be relatively straight forward rules like Sidekiq problems or running low on disk space, but we could also consider general performance problems as well.
For example if certain user facing events are below a given threshold, we can then alert:
- If the average latency for MR/Issue/etc is above X
- If the average SSH commit is above X
- If the average CI queue depth is above X minutes
This could be a way of informing administrators and users that there is an issue that needs attention on their server. This could be as simple as tuning a few parameters, adding a few more CI runners or tweaking the concurrency, or even switching to a larger server with more resources.
As for notifications, we could consider providing administrators warning initially and then if action is not taking to either address the issue or "silence" the alert, we could warn end-users as well. This would help to drive awareness within the company and also that the performance issues could be addressed by a certain course of action.
This idea was also proposed on HN: https://news.ycombinator.com/item?id=14833137