Skip to content

Stop using sidekiq-cron for scheduling all project mirror pull and move to adaptive scheduling

What does this MR do?

Are there points in the code the reviewer needs to double check?

Why was this MR needed?

Screenshots (if relevant)

Checklist

  • Repurpose UpdateAllMirrorsWorker to start every minute and obtain an exclusive lease.

  • Add a separated table to track mirror state so we can remove it from projects, fields are:

    • project_id

    • next_execution_ts: originally set to now()

    • last_update_started_at: the time the last update started, this will be used to calculate the cost along with last_updated_at

    • retry: to signal how many times it failed and backoff spacing failures out of successful projects

  • Fill this table with the projects that have mirroring enabled - we will need to later on check that we have all of them there by selecting the jobs that are not in that table but do have it enabled just for sanity.

  • Add an integer record in redis to track available capacity (MIRROR_PULL_CAPACITY) then we define how many jobs we want to have scheduled at any given time.

  • We run a scheduler process, separated from rails or sidekiq (we may be using sidekiq here, but we need to find a way of having only one of this running at any time, not ruled by cron)

Scheduler Process:

  • On start:

    • Set MIRROR_PULL_CAPACITY to 0.

    • Perform a sanity check inserting all the mirror processes that are missing. (I do not think this is needed)

    • Start execution.

    • Set ruby to catch the failing exception and signal and log it.

  • On execution:

    • Select jobs order by next_execution_ts where this time is less than now() and status is not 'STARTED' (I do not think we need the scheduled state) limited by remaining capacity (max - MIRROR_PULL_CAPACITY) + a bit more just in case there is clear room.

    • for each job in the list and while there is MIRROR_PULL_CAPACITY left:

      • Add it to the sidekiq queue so it gets executed.

      • INCR MIRROR_PULL_CAPACITY by one until we hit the max mirror pull capacity.

      • Set the status of the job to 'SCHEDULED' (I do not think this will be needed).

      • Add one to the scheduled jobs counter for prometheus.

    • Select jobs order by next_execution_ts which state is 'ERROR' or 'RUNNING' and next_execution_ts is less than an arbitrary time (whatever grace period we give the jobs to be completed, maybe an hour?) (I think this is already well done with the current implementation of finding stuck mirrors but we can diminish the time to 1 hour instead of 2)

    • For each failed job in the list:

      • Set the next_execution_ts the following way (we consider both ERROR and RUNNING as the same thing)

      • Increase the retry by 1 and set the next_execution as roughly (backoff_period + jitter) * cost * retry.

      • Add one to the rescheduled jobs counter for prometheus.

    • Sleep either for an arbitrary time or we sleep for the max job cost we found while scheduling, or in the case of not finding anything to schedule we peek the next thing scheduled and we wait until that time+cost+some jitter. (If we are still using crontab we won't need this)

  • Process stopping on exception:

    • Add one to the process aborted counter for prometheus.

    • Supervisor (runit?) starts the process again. (If using sidekiq crontab we wont need this)

  • Running job in sidekiq:

  • Pick the job and start counting time.

  • Add one to the process being counter executed for prometheus.

  • Process the pull

  • Success:

    • Add one to the process successfully finished counter for prometheus.

    • DECR MIRROR_PULL_CAPACITY

    • Update the row setting the cost to how much it took to run, and the next_execution_ts to something like (backoff_period + jitter) * cost

  • On failure:

    • DECR MIRROR_PULL_CAPACITY

    • [x] Add one to the process error counter for prometheus.

Other observations:

  • Set a max cost + jitter so we don't punish too much highly active projects.

  • Add a metric for the age of the first project that gets scheduled to check how far behind are we.

  • Max mirror pull capacity can be configured in web UI

  • [x] Backoff period needs to be larger than the worker's default lease expiration, so that one project isn't at the front of the queue in two subsequent scheduled runs.

  • Punish a project for an empty pull, just like a failed pull, since it probably won't have anything next time either. Use a max punish time, of course.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Closes https://gitlab.com/gitlab-org/gitlab-ce/issues/29218

Edited by username-removed-117638

Merge request reports