During a merge, the MergeService gets a lock on the merge request. At the end of the block the lock is released. However, if the merge fails in the middle (let's say because Sidekiq is killing jobs and re-enqueues the job), the lock will not be released. The job will run on the next worker to pick up the job, but because there's already a lock it will silently finish.
cc/ @jameslopez This one might be difficult. Can you think of anything we can do to properly release the lock if the job gets re-enqueued?
cc/ @jacobvosmaer@stanhu Related to recent Sidekiq issues with the customer. FYI.
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
No child items are currently open.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
If I understood this correctly, what comes to my mind - and this is without checking the code - is that when that happens you normally want an Ensure block in Ruby that does the cleanup if there is a failure, so we could release the lock at that point.
If that is not possible, i.e we couldn't capture the kill signal from Sidekiq or any other problems that killed the job, we could ignore those jobs in Sidekiq first, and create a cleanup job that releases the lock for failed jobs so they can be picked by a worker. This can also get clever, as an example... we may not want to retry indefinitely a job that keeps failing (i.e we could set max retries, etc), otherwise the job will keep running and using resources indefinitely. (Actually this is most likely handled by Sideqik anyway).
What if we had a time-bounded lock? For example, if after 24 hours the lock is still there, we automatically invalidate it. This would at least make the system recover over time.
Oh, I guess there is a check for an outdated lock. It sounds like the goal would be to make the worker pick up a failed merge request attempt. There will always be a case where the ensure block won't run, and we'll be stuck with a dangling lock. However, if Sidekiq requeues the task, that must mean it detected some error in the worker. We can hook an error handler to that to at least clear the state. Would that work?
Ideally, I think we would want the killed job to have a chance to clean up (e.g. unlock the MR in this case). Let's assume that doesn't run for some reason. What else can we do?
When the next retried MergeWorker job runs, there should be a way to verify that the lock is no longer valid, and that it should go ahead and run. How do we do that? Do we need some sort of heartbeat so that the second MergeWorker can wait for X amount of time before declaring the previous job is dead?
@stanhu We could have the MergeWorker acquire a lock for 15 seconds, and re-acquire it before is released, like every 10 seconds, if for whatever reason it needs that much time. The next MergeWorker will get a chance at most 15 seconds after the first MergeWorker has died.
@DouweM How is the merge request subsequently unlocked if a worker fails? Another worker cannot be triggered if the merge request is locked (the UI won't allow it)
@dblessing I think there are two cases to consider:
When Sidekiq retries a MergeWorker due to a graceful shutdown
When the first MergeWorker fails completely (e.g. system shuts down while in progress) and no second worker runs
We see the first case quite often in the logs, and the retry fails as you described since the MR is locked. @DouweM's suggestion would fix this by having the second MergeWorker take over the MR if the "heartbeat" lock isn't touched in some time.
Right now the second case recovers "automatically" by MergeRequestController if the MR has been locked for more than 24 hours.
Another idea: We could also save the Sidekiq JID in the MergeRequest model. If a MR has been locked for more than 10 minutes, for example, we could have a task that searches the Sidekiq queue for that JID. If none exists, we could attempt to retry.
@stanhu That's an interesting idea, although 10 minutes is quite a long time before we realize a merge didn't go through. Doing it any more often would be increase the load though.
@stanhu why not making that part of the retry mechanism instead of having a separate task ? If I got it right, it looks like a slight different lock to me - if MergeRequest has no JID, or it has JID and it's not in the queue then retry/merge... Not even sure we need the locked transition with that column 🤔