Geo: Monitoring
To make a robust monitoring solution for Geo we have to answer the following questions, with one or more solutions:
- How to make sure replication is working?
- Database replication
- Rest API based replication
- Git repositories
- How to detect delays in database replication?
- How to detect delays in sidekiq dependent replications? (git/wiki/ssh keys)
- How to track failures?
- How to track reschedules?
- Everything is async, how to make sure we are not loosing important data?
- How to detect if primary can communicate with secondary?
- Via HTTP / HTTPS (check certificates)
- How to detect if secondary can communicate with primary?
- Via HTTP / HTTPS (check certificates)
- Via SSH
- Check if ssh-key is correctly enabled
- Check if we have primary in known_hosts
Proposal
We have things that should be checked once during setup, that should not change during execution time, and we have state and failures that can happen during execution time.
For the first set of things, it should be part of either a rake task or a configuration check page in Admin screen.
For the second set of things we should add something to the Health Check API endpoint, or similar with more verbose details.
There are some interesting resources to explore from sidekiq monitoring here:
- https://github.com/mperham/sidekiq/wiki/Monitoring#monitoring-queue-backlog
- https://github.com/mperham/sidekiq/wiki/API
Related issues:
- #1611 (closed)
- #1255
- #1664 (closed)
- #1751 (closed)
- gitlab-org/gitlab-ce#28080