Stop using sidekiq-cron for scheduling all project mirror pull and move to adaptive scheduling
Description
Our current implementation of mirror pull is way too naive and is torturing the systems by piling a lot of jobs at once and using all the available capacity until it flushes the queue, and then the clock ticks again and we push the hamster into the wheel yet again, hour after hour after hour.
This is highly impactful on the performance of the application, and is creating huge spikes of work to do, and even missing performing the task sometimes.
You can even see how the end of the sidekiq job execution impacts across the whole fleet, my theory for this is that the most costly processes are falling to the end of the execution because they take too long and our current execution is ordering by last execution time. So all the load is at the end of the tail naturally.
On top of this, there was a feature added that allows to setup the mirror to run every 15 minutes in an effort to make it more responsive, which would just take GitLab.com down or block the queues completely.
Additionally, the way we manage the mirrors is generating a lot of table bloat because every time we test a mirror we are updating it, and the projects table is one of the tables that shows when we are getting database contention, because it is happening all the time.
Proposal
Stop using crontab for jobs like this and move to a continuous scheduling process that will measure the cost of execution and will make it a constant stream instead of a spike execution, following with the initial proposed idea of moving to an event sourcing approach instead of a constant polling.
In general I was imagining that something like this would make sense for this particular case (the git pull) where we do not have an event at all:
- We add a separated table to track mirror state so we can remove it from projects, fields are
project_id
-
next_execution_ts
: originally set tonow()
-
cost
: the time it took last time to perform the mirror operation -
retry
: to signal how many times it failed and backoff spacing failures out of successful projects state
- We fill this table with the projects that have mirroring enabled - we will need to later on check that we have all of them there by selecting the jobs that are not in that table but do have it enabled just for sanity.
- We add an integer record in redis to track available capacity (MIRROR_PULL_CAPACITY) then we define how many jobs we want to have scheduled at any given time.
- We run a scheduler process, separated from rails or sidekiq (we may be using sidekiq here, but we need to find a way of having only one of this running at any time, not ruled by cron)
Scheduler Process:
- On start:
- We set MIRROR_PULL_CAPACITY to 0.
- We perform a sanity check inserting all the mirror processes that are missing.
- We start execution.
- We set ruby to catch the failing exception and signal and log it.
- On execution:
- we select jobs order by
next_execution_ts
where this time is less thannow()
and status is not 'SCHEDULED' limited by remaining capacity (max - MIRROR_PULL_CAPACITY) + a bit more just in case there is clear room. - for each job in the list and while there is MIRROR_PULL_CAPACITY left:
- we add it to the sidekiq queue so it gets executed.
- we INCR MIRROR_PULL_CAPACITY by one until we hit the max mirror pull capacity.
- we set the status of the job to 'SCHEDULED'.
- we add one to the scheduled jobs counter for prometheus.
- we select jobs order by
next_execution_ts
which state is 'ERROR' or 'RUNNING' andnext_execution_ts
is less than an arbitrary time (whatever grace period we give the jobs to be completed, maybe an hour?) - for each failed job in the list:
- we set the
next_execution_ts
the following way (we consider both ERROR and RUNNING as the same thing)- we increase the retry by 1 and set the next_execution as roughly (backoff_period + jitter) * cost * retry.
- we add one to the rescheduled jobs counter for prometheus.
- we set the
- then we sleep either for an arbitrary time or we sleep for the max job
cost
we found while scheduling, or in the case of not finding anything to schedule we peek the next thing scheduled and we wait until that time+cost+some jitter.
- we select jobs order by
- On process stopping on exception:
- We add one to the process aborted counter for prometheus.
- the supervisor (runit?) starts the process again.
Running job in sidekiq:
- We pick the job and start counting time.
- We add one to the process being counter executed for prometheus.
- We process the pull
- On success:
- We add one to the process successfully finished counter for prometheus.
- We DECR MIRROR_PULL_CAPACITY
- We update the row setting the
cost
to how much it took to run, and thenext_execution_ts
to something like (backoff_period + jitter) * cost
- On failure:
- We DECR MIRROR_PULL_CAPACITY
- We add one to the process error counter for prometheus.
cc/ @DouweM @tiagonbotelho for feedback checking if this makes sense.