Need mechanism to discover backup-problems
Description
Purely by accident, I discovered that the GitLab backup process ("gitlab-rake backup") was failing on our on-premises installation, and that we do not have a recent backup.
The full backup/recovery solution is of course bigger than the gitlab application itself, but the "gitlab-rake backup" command is an important part of it. We need good mechanisms for discovering when this fails. Log files, whether stored on a server or sent by email, have many failure scenarios, and we need easier access:
- We should have a way to manually ask for the current status, e.g. in a console
- We should have a way to automatically discover the status, e.g. using a monitoring tool
Proposal
I first suggest the following:
- At the end of the backup, when all is fine, the timestamp of the last backup file is written to the database, along with the file size
- Somewhere in the admin page (preferably at least on the /admin summary page) the time of the last completed backup is shown
Next, for better usability and automation:
- Make it possible to configure "max backup age", for example 86400 seconds (1 day)
- If the last backup timestamp is older than max age, show the "last backup time" in a "warning" color, to show that this is unhealthy
- Integrate this check into the /admin/health_check page, so that an unhealthy backup can be discovered by automation
- Possibly create a separate /admin/health_check/backup page, with status code indicating status, and content showing date and size of latest backup
Another automation scenario is to export two some prometheus metrics, suitable for alerting, or for graphing in a dashboard:
-
gitlab_backup_last_success_seconds
: Time since epoch for last successful backup. For example, alert if this is too long ago. -
gitlab_backup_last_success_size_bytes
: Size of last completed backup set, in bytes. For example, alert if this is less than XXX, or drops by more than 1%, or just show a graph. -
gitlab_backup_last_run_seconds
: Time since epoch for last backup -
gitlab_backup_last_run_elapsed_seconds
: Number of seconds of last backup run -
gitlab_backup_last_run_status
: Indicates the backup status (success vs error)
Links / references
Waddayknow, gitlab-com/githost#74 indicates that this is a problem also in your own operations. The solution above would help you ensure that "gitlab-rake backup" works as expected (though you obviously also need additional backup monitoring, as well as full restore verifications every now and then).
(Edited 2017-02-02)