Improve handling of stuck mirrors
Extracted from https://gitlab.com/gitlab-org/gitlab-ee/issues/3021#note_36460320
We treat any mirror that spends more than 20 minutes on the Sidekiq queue (scheduled
) as stuck, mark it as failed and set it to be executed again ASAP, even if the job is still on the queue and just a little delayed, which can happen all the time depending on load, or when we deploy and pause Sidekiq. This cleaning up of "stuck" mirrors can happen multiple times if we're hitting anything below 150 mirror workers per 20 minutes, resulting in potentially hundreds of projects being marked as failed while their workers are still on the queue. These "stuck" workers get their next execution timestamp set to "right now", which means they skip the line and will be rescheduled ahead of any mirrors that were already in the queue, blocking them from running. Since the workers are still on the queue and don't check if their mirror is still in the scheduled
state when they run, all of them happily update their project, which of course results in even longer delays, and even more projects getting marked as failed and skipping the line, exacerbating the problem... Earlier today we had 19696 mirrors that had their import_error
set to The mirror update took too long to complete.
, out of about 24k, which means that they at one point were marked as "stuck". If we hadn't marked these mirrors that were still on the queue as failed, the Sidekiq queue wouldn't have grown beyond 150, these mirrors would still eventually have ran, and so would the projects that were in line behind them.
We should:
-
Detect stuck mirrors using the same mechanism we use to detect stuck initial imports, which actually checks whether the job ID is still on the Sidekiq queue. A mirror that is still queued is never stuck. We need to set the
SidekiqStatus
timeout appropriately, of course. -
Store that JID in the DB at schedule time, not at start time (like initial imports do right now)
-
Use that mechanism for forks too, for consistency
-
Treat projects that are in state
started
but don't have a JID as stuck -
Fail early from import/fork/mirror workers when state is not actually
scheduled
(ornone
?) -
Reset
import_error
at finish, NOT at start, since we still want to display it during an update