GitLab health check

@JobV please schedule this

Milestone changed to 8.7

Another customer requested a health check endpoint so they can configure their load balancer to take the server out of the mix if it's not healthy. This would be useful for monitoring systems like Sense/Nagios, too. I found a couple of other interesting gems we could potentially use:

I'm not sure about the 'out of service' use-case. It's an interesting one but usually people just disable the particular server in the load balancer during maintenance. If it's doable, it seems interesting, but health check in general is more useful maybe.

The description also says where a response to users - How do users see this response? A health check endpoint is usually JSON that is consumed by automated systems/load balancers as opposed to users. We have a deploy page that can be raised for conveying this issue to users.

@dblessing I think we should just implement the health check new and forget about the override/outofservice/etc.

https://github.com/lbeder/health-monitor-rails looks pretty sweet

mentioned in issue omnibus-gitlab#839 (closed)

A big customer indicated they really need this during the conference yesterday.

cc @DouweM

Milestone changed to 8.8

Added ~122770 label

https://www.ruby-toolbox.com/projects/health_check looks more popular than https://www.ruby-toolbox.com/projects/health-monitor-rails

please use health_check and use its default settings and url

mentioned in issue omnibus-gitlab#1233 (closed)

mentioned in merge request !3888 (merged)

Reassigned to @twk3

mentioned in issue #3883 (moved)

We just encountered a situation where the backup cron job was incorrectly configured, and did not run properly. Our fault, obviously, but it would be nice if the health check could be configured to verify this. Over at https://gitlab.com/gitlab-org/gitlab-ce/issues/3883#note_5069011 I suggested a healt check-like endpoint:

Second, a suggestion: Create a health endpoint, where we could monitor a url like "/backup_check?max-age=1440". This URL would then verify whether there is a backup less than 1440 minutes old, and return 200 OK if that was the case. If not, however, it would return a 500 status. This will integrate nicely with our, and probably most, monitoring systems, who are all equipped to verify that a web endpoint is up and running.

Or, if included in the healt check system, perhaps a configuration option ("max allowed backup age") which then is automatically handled by /health_check, or manually by /health_check/backup (inspired by https://github.com/ianheggie/health_check).

Discovering uptime/downtime is important, but this will mostly be discovered anyway (your users will complain, and if they don't, you have bigger problems). A missing backup may not be discovered until it's too late, and then there are no bigger problems.

Status changed to closed by merge request !3888 (merged)

mentioned in merge request !14785

GitLab health check

Designs

Child items ...

Activity

Admin message

Admin message

GitLab health check

Activity