Improve monitoring
We are getting to the point where we are not really happy with CheckMK.
The main reasons are:
- it's not trivial to add metrics and/or the plugins offered are not giving enough control.
- in a bad network weather day we will have a lot of false positives.
- the alerting capabilities seem to be limited, too binary.
- the UI is too complex and it's hard (or impossible) to build dashboards to check the system health in one view.
- we are maintaining 2 monitoring tools - influxdb for performance and checkmk for host monitoring.
It would be interesting to try something else and see what the capabilities are. Initial conversations went in the way of considering hiring a service, @stanhu is handling this possibility.
The next possibility was to reconsider prometheus since it is a time series database that also includes alerting.
Nothing decided here, just gathering thoughts.
Possible systems to try (please gather pros and cons of each):
-
Sensu - comments on Sensu: https://gitlab.com/gitlab-com/operations/issues/249#note_11673223 -
Prometheus - comments on prometheus: https://gitlab.com/gitlab-com/operations/issues/249#note_12477542