Merge request could get stuck merging if Sidekiq job fails

If I understood this correctly, what comes to my mind - and this is without checking the code - is that when that happens you normally want an Ensure block in Ruby that does the cleanup if there is a failure, so we could release the lock at that point.

If that is not possible, i.e we couldn't capture the kill signal from Sidekiq or any other problems that killed the job, we could ignore those jobs in Sidekiq first, and create a cleanup job that releases the lock for failed jobs so they can be picked by a worker. This can also get clever, as an example... we may not want to retry indefinitely a job that keeps failing (i.e we could set max retries, etc), otherwise the job will keep running and using resources indefinitely. (Actually this is most likely handled by Sideqik anyway).

What if we had a time-bounded lock? For example, if after 24 hours the lock is still there, we automatically invalidate it. This would at least make the system recover over time.

Oh, I guess there is a check for an outdated lock. It sounds like the goal would be to make the worker pick up a failed merge request attempt. There will always be a case where the ensure block won't run, and we'll be stuck with a dangling lock. However, if Sidekiq requeues the task, that must mean it detected some error in the worker. We can hook an error handler to that to at least clear the state. Would that work?

This happened again. I suppose once we're able to upgrade to 8.9 and get https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4511 it will reduce the likelihood of this happening, but there should still be a fix.

mentioned in issue #18048 (closed)

Ideally, I think we would want the killed job to have a chance to clean up (e.g. unlock the MR in this case). Let's assume that doesn't run for some reason. What else can we do?

When the next retried MergeWorker job runs, there should be a way to verify that the lock is no longer valid, and that it should go ahead and run. How do we do that? Do we need some sort of heartbeat so that the second MergeWorker can wait for X amount of time before declaring the previous job is dead?

@stanhu We could have the MergeWorker acquire a lock for 15 seconds, and re-acquire it before is released, like every 10 seconds, if for whatever reason it needs that much time. The next MergeWorker will get a chance at most 15 seconds after the first MergeWorker has died.

@DouweM How is the merge request subsequently unlocked if a worker fails? Another worker cannot be triggered if the merge request is locked (the UI won't allow it)

@dblessing I think there are two cases to consider:

When Sidekiq retries a MergeWorker due to a graceful shutdown
When the first MergeWorker fails completely (e.g. system shuts down while in progress) and no second worker runs

We see the first case quite often in the logs, and the retry fails as you described since the MR is locked. @DouweM's suggestion would fix this by having the second MergeWorker take over the MR if the "heartbeat" lock isn't touched in some time.

Right now the second case recovers "automatically" by MergeRequestController if the MR has been locked for more than 24 hours.

Another idea: We could also save the Sidekiq JID in the MergeRequest model. If a MR has been locked for more than 10 minutes, for example, we could have a task that searches the Sidekiq queue for that JID. If none exists, we could attempt to retry.

@stanhu That's an interesting idea, although 10 minutes is quite a long time before we realize a merge didn't go through. Doing it any more often would be increase the load though.

@stanhu why not making that part of the retry mechanism instead of having a separate task ? If I got it right, it looks like a slight different lock to me - if MergeRequest has no JID, or it has JID and it's not in the queue then retry/merge... Not even sure we need the locked transition with that column 🤔