Customer requested we add the ability to use http for health checks where a response to users says 'out of service' while they work on the box. => leave this for later
Sytse's note: use health_check gem, override with .outofservice file => leave this for later
Another customer requested a health check endpoint so they can configure their load balancer to take the server out of the mix if it's not healthy. This would be useful for monitoring systems like Sense/Nagios, too. I found a couple of other interesting gems we could potentially use:
I'm not sure about the 'out of service' use-case. It's an interesting one but usually people just disable the particular server in the load balancer during maintenance. If it's doable, it seems interesting, but health check in general is more useful maybe.
The description also says where a response to users - How do users see this response? A health check endpoint is usually JSON that is consumed by automated systems/load balancers as opposed to users. We have a deploy page that can be raised for conveying this issue to users.
We just encountered a situation where the backup cron job was incorrectly configured, and did not run properly. Our fault, obviously, but it would be nice if the health check could be configured to verify this. Over at https://gitlab.com/gitlab-org/gitlab-ce/issues/3883#note_5069011 I suggested a healt check-like endpoint:
Second, a suggestion: Create a health endpoint, where we could monitor a url like "/backup_check?max-age=1440". This URL would then verify whether there is a backup less than 1440 minutes old, and return 200 OK if that was the case. If not, however, it would return a 500 status. This will integrate nicely with our, and probably most, monitoring systems, who are all equipped to verify that a web endpoint is up and running.
Or, if included in the healt check system, perhaps a configuration option ("max allowed backup age") which then is automatically handled by /health_check, or manually by /health_check/backup (inspired by https://github.com/ianheggie/health_check).
Discovering uptime/downtime is important, but this will mostly be discovered anyway (your users will complain, and if they don't, you have bigger problems). A missing backup may not be discovered until it's too late, and then there are no bigger problems.