If we can somehow create a one-at-a-time Sidekiq job queue we could have continuous jobs that runs fsck on the repo with the oldest previous fsck. Then we need to keep track of (?) the last successful and failed fsck on each repo.
Doing things in checkmk adds little value for other deployments.
I think it is better to aim for 0 repositories with errors. 'git fsck' errors should not be something normal administrators only worry about if the trend goes up; they are rare, important, and should be dealt with individually.
I thought a little about how to build this. I think we need two new columns in the projects table: last_fsck_at and last_fsck_state. Possible states are ok, repo_failed, wiki_failed, both_failed.
To prevent lots of 'git fsck' processes at the same time I think we could have an hourly sidekiq-cron job that does the following:
start = Time.nowwhile Time.now - start < 1.hour do project = Project.where('last_fsck_at < ?', 1.week.ago).asc(:last_fsck_at).first break if project.nil? # run git fsck, update state of project, update last_fsck_atend
I am not sure if a sidekiq-cron job will run only once per GitLab cluster or once per GitLab host; either of those should be fine load-wise.
In addition to this we should have a daily sidekiq cron job that emails all GitLab admins if there are any projects in the DB with last_fsck_state != ok.
I think we should not automatically notify project owners because:
they cannot go on the GitLab server to fix things in a shell
there is a good chance of false alarms (e.g. NFS share not mounted) which could end up emailing tons of project owners for no good reason
We can use ExclusiveLease to prevent running git fsck on the same repo twice at the same time.
When a new 1-hour job starts the old one may not have finished yet. But momentarily running just two git fsck processes at the same time instead of one should not be a problem.
I am not liking how much complication a state machine adds to the code base just to track what happened during the last 'git fsck'. Maybe we should just have a flag 'last_repo_check_failed' and use a special log file to report what went wrong, if anything.
This was merged into 8.7 but it still gives too many false alarms, so we did not turn it on for general use yet. Still I think we can close this issue.
@jacobvosmaer I know that seems weird. I just migrated to gitlab-ce omnibus managed with puppet and migrated all repos from our gitlab-6.5 machine. One repo where we have the rails code has an broken commit (this one) which has a wrong timezone.
$ git cat-file -p 4cf94979c9f4d6683c9338d694d5eb3106a4e734tree 7989dfb2ec2f41914611a22fb30bbc2b3849df9aparent 8845ae683e2688bc619baade49510c17e978518fauthor Vijay Dev <vijaydev.cse@gmail.com> 1312735823 +051800committer Vijay Dev <vijaydev.cse@gmail.com> 1312735823 +051800
So I could not push this repository to the new gitlab:
$ git push --mirror git@gitlab.local:proj/rails.gitCounting objects: 322680, done.Delta compression using up to 4 threads.Compressing objects: 100% (71292/71292), done.remote: error: object 4cf94979c9f4d6683c9338d694d5eb3106a4e734: badTimezone: invalid author/committer line - bad time zoneremote: fatal: Error in objecterror: pack-objects died of signal 13error: failed to push some refs to 'git@gitlab.local:proj/rails.git'
There are quite some people using this repository and fixing the commit would probably lead to many hours of support.
My current workaround was to set a local option to that repository like this:
$ git config receive.fskObjects false
Also I really like the idea of the auto-fsck but would like to exclude this repository from that and any git-fsck.
I am open for better solutions, though.
@Wayneoween thanks, I had never heard of this. So if I understand correctly that repo is working fine in day-to-day use but git fsck considers that one commit broken? That sucks. :(
Thinking about it some more, I expect we will want some sort of 'ignore this repo' option on gitlab.com ourselves. Users occasionally abandon broken projects (broken due to failed imports etc.) and we will probably want to leave those be rather than 'fix' them without asking the user, or asking the user to fix them.
@Wayneoween so even though I wrote 'we were not planning it' we (people caring for gitlab.com) may learn we need it anyway.
@jacobvosmaer you are right, everything else works fine, but this one commit throws an error and the push stops.
Since this is only the one repository my workaround does fly for us. I am glad that I could raise awareness on this.
Thanks for your time!