Continuous `git fsck` jobs

If we can somehow create a one-at-a-time Sidekiq job queue we could have continuous jobs that runs fsck on the repo with the oldest previous fsck. Then we need to keep track of (?) the last successful and failed fsck on each repo.

My proposal was to count the number of repo's with a problem and graph that. Also have a check in checkmk when the graph suddenly jumps up.

Doing things in checkmk adds little value for other deployments.

I think it is better to aim for 0 repositories with errors. 'git fsck' errors should not be something normal administrators only worry about if the trend goes up; they are rare, important, and should be dealt with individually.

@jacobvosmaer OK

@JobV this feature is valuable but not urgently needed.

Milestone changed to 8.2

Milestone changed to 8.4

mentioned in issue #12620 (closed)

Milestone changed to 8.6

I thought a little about how to build this. I think we need two new columns in the projects table: last_fsck_at and last_fsck_state. Possible states are ok, repo_failed, wiki_failed, both_failed.

To prevent lots of 'git fsck' processes at the same time I think we could have an hourly sidekiq-cron job that does the following:

start = Time.now
while Time.now - start < 1.hour do
  project = Project.where('last_fsck_at < ?', 1.week.ago).asc(:last_fsck_at).first
  break if project.nil?
  # run git fsck, update state of project, update last_fsck_at
end

I am not sure if a sidekiq-cron job will run only once per GitLab cluster or once per GitLab host; either of those should be fine load-wise.

In addition to this we should have a daily sidekiq cron job that emails all GitLab admins if there are any projects in the DB with last_fsck_state != ok.

I think we should not automatically notify project owners because:

they cannot go on the GitLab server to fix things in a shell
there is a good chance of false alarms (e.g. NFS share not mounted) which could end up emailing tons of project owners for no good reason

The loop I proposed above is meant to restrict load (we only process one project at a time) and be resilient to failure (Sidekiq shutdowns).

We can use ExclusiveLease to prevent running git fsck on the same repo twice at the same time.

When a new 1-hour job starts the old one may not have finished yet. But momentarily running just two git fsck processes at the same time instead of one should not be a problem.

Maybe we should call it 'git check' and not be too specific about 'git fsck'.

Reassigned to @jacobvosmaer

Milestone changed to 8.7

I am not liking how much complication a state machine adds to the code base just to track what happened during the last 'git fsck'. Maybe we should just have a flag 'last_repo_check_failed' and use a special log file to report what went wrong, if anything.

WIP MR: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/3232

Reassigned to @jacobvosmaer-gitlab

This was merged into 8.7 but it still gives too many false alarms, so we did not turn it on for general use yet. Still I think we can close this issue.

Status changed to closed

@jacobvosmaer Will it be possible to disable the git-fsck for single repositories?

@Wayneoween we were not planning on that. Could you explain why that would be useful to you?

@jacobvosmaer I know that seems weird. I just migrated to gitlab-ce omnibus managed with puppet and migrated all repos from our gitlab-6.5 machine. One repo where we have the rails code has an broken commit (this one) which has a wrong timezone.

$ git cat-file -p 4cf94979c9f4d6683c9338d694d5eb3106a4e734
tree 7989dfb2ec2f41914611a22fb30bbc2b3849df9a
parent 8845ae683e2688bc619baade49510c17e978518f
author Vijay Dev <vijaydev.cse@gmail.com> 1312735823 +051800
committer Vijay Dev <vijaydev.cse@gmail.com> 1312735823 +051800

So I could not push this repository to the new gitlab:

$ git push --mirror git@gitlab.local:proj/rails.git
Counting objects: 322680, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (71292/71292), done.
remote: error: object 4cf94979c9f4d6683c9338d694d5eb3106a4e734: badTimezone: invalid author/committer line - bad time zone
remote: fatal: Error in object
error: pack-objects died of signal 13
error: failed to push some refs to 'git@gitlab.local:proj/rails.git'

There are quite some people using this repository and fixing the commit would probably lead to many hours of support. My current workaround was to set a local option to that repository like this: $ git config receive.fskObjects false

Also I really like the idea of the auto-fsck but would like to exclude this repository from that and any git-fsck. I am open for better solutions, though.

@Wayneoween thanks, I had never heard of this. So if I understand correctly that repo is working fine in day-to-day use but git fsck considers that one commit broken? That sucks. :(

Thinking about it some more, I expect we will want some sort of 'ignore this repo' option on gitlab.com ourselves. Users occasionally abandon broken projects (broken due to failed imports etc.) and we will probably want to leave those be rather than 'fix' them without asking the user, or asking the user to fix them.

@Wayneoween so even though I wrote 'we were not planning it' we (people caring for gitlab.com) may learn we need it anyway.

@jacobvosmaer you are right, everything else works fine, but this one commit throws an error and the push stops. Since this is only the one repository my workaround does fly for us. I am glad that I could raise awareness on this. Thanks for your time!

Continuous `git fsck` jobs

Designs

Child items ...

Activity

Admin message

Admin message

Continuous `git fsck` jobs

Activity