I checked the Sidekiq queue as well and there are indeed a lot of UpdatePipelineWorker jobs, but also ProjectCacheWorker jobs. This last worker had a 45 minute outage starting at 13:00 UTC today:
So, apparently there is a job that got queued and starved all the sidekiq workers (probably computational heavy) for 45 minutes, this led to building a huge queue of jobs to be processed which ended up blocking everything for a couple of hours.
The resolution was twofold:
the job finished, so things got into being processed again.
we doubled sidekiq threads, which gave it way much more power to process delayed jobs quickly.
To figure out which workers were to blame for the outage I built a list of the workers that use the "default" queue (the one being blocked). This list is as follows:
I then wrote the following Ruby script to see which workers were processing data during the outage:
require'influxdb'require'time'client=InfluxDB::Client.new('gitlab',host: 'performance.gitlab.net',user: 'root',password: 'hunter2')workers=['AdminEmailWorker','BuildCoverageWorker','BuildEmailWorker','BuildFinishedWorker','BuildHooksWorker','BuildSuccessWorker','ClearDatabaseCacheWorker','DeleteUserWorker','ExpireBuildArtifactsWorker','ExpireBuildInstanceArtifactsWorker','GroupDestroyWorker','ImportExportProjectCleanupWorker','IrkerWorker','MergeWorker','NewNoteWorker','PipelineHooksWorker','PipelineProcessWorker','PipelineSuccessWorker','PipelineUpdateWorker','ProjectCacheWorker','ProjectDestroyWorker','PruneOldEventsWorker','RemoveExpiredGroupLinksWorker','RemoveExpiredMembersWorker','RepositoryArchiveCacheWorker','RepositoryCheck::BatchWorker','RepositoryCheck::ClearWorker','RepositoryCheck::SingleRepositoryWorker','RequestsProfilesWorker','StuckCiBuildsWorker','UpdateMergeRequestsWorker']condition=workers.map{|worker|"action = '#{worker}#perform'"}.join(' OR ')rows=client.query<<-EOFSELECT SUM("count") AS amountFROM downsampled.sidekiq_transaction_counts_per_actionWHERE time >= '2016-10-14 13:17:00'AND time <= '2016-10-14 14:00:00'AND (#{condition})GROUP BY action;EOFrows.eachdo|row|putsrow['tags']['action']end
ProjectCacheWorker does a whole bunch of Git related operations. It's not unlikely for this to take quite some time. We also happen to schedule this worker after every push. We might need to break this worker up into smaller workers using separate queues.
The RC2 contains part of CI changes for asynchronous processing. Ideally the UpdatePipelineWorker should be executed very quickly, but due to some operations still happening during lock that accesses the filesystem I believe that this is the main cause of bigger pressure on sidekiq.
Looking at Grafana I see that in a lot of cases UpdatePipelineWorkers is waiting on FS during lock, and others are waiting on the lock.
Since this is also not the first time that we do heavy duty operations in filesystem I made a MR (still as proof of concept) to prevent such things from happening in the future as it is tempting to put everything in state machine transaction: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6894 :)
Guys, I'm loving the "Everything in the open" mode of work you're employing here; particularly that you've maintained this mode of working even during an outage which led to a partial loss of service. That the subsequent response time was speedy, the analysis was excellent, and that issues to prevent it happening again in the future were immediately created, is just more props to all of you. Great work 💯👍
@dchambers@sytses Agree! And there's a second benefit in the "Everything inthe open": I'm building a big Ruby project, and I'm learning a lot with GitLab project. This kind of issue helps me to learn a lot about Sidekiq (we are using for the first time), and learn Ruby as well. Thank you so much!
Killed the query two more times with postgres insisting upon starting it back up every time. I'm currently letting it run with an eye on the database and queues.
@northrup It seems like this is happening again. Pushes are taking over 30 minutes to be reflected in MRs and have builds start for them. Same with pipeline stages starting after the previous stage finishes.
@rabbitfang we are having some massive problems with the file system at the moment, so any activity that deals with touching the FS on GitLab.com right now is being delayed. I apologize.. we've got the status and issue for it up at the GitLab Status Page
We just got another event of this when we spiked up to 30k queued jobs
The resolution was to invoke kill in all the workers sudo gitlab-ctl kill sidekiq, this unlocked the jobs and allowed the rest of the jobs to be slowly processed.
My gut feeling is that we have some process that just hangs the threads forever and the just lock up. So the only resolution is to drop those jobs.
A valid solution to think about here will be to have some form of timeout for when we are processing a job, I don't know how or if it is possible in sidekiq at all, but we should gracefully unlock long running threads that are not allowing anything else to run at all.
It looks like Sidekiq problem is just a result of FS problems. Since both ProjectCacheWorker and UpdateMergeRequestsWorker are actively using repositories.
Having separate queues on separate servers would make situation a bit better. The key here is to have separate process for every queue, so it could be the same server. The last option is easier to implement of course. Also queues could be split into groups.
Current status is that we have a runbook for this and we narrowed the scope to the gitlab_shell queue.
The path forward seems to be that we need to have the ability to pick specific queues and spawn them in isolated processes while evicting these queues from the main sidekiq process. We are doing this manually for now and has been working quite well.
So the solution is affecting both packaging and development.
FYI we recently had the enqueued stuck at 90K. Running sudo gitlab-ctl kill sidekiq and later upping sidekiq['concurrency'] to 200 got it cleared out in a few hours. Is there a troubleshooting guide we can add these details to?
We don't use concurrency to solve this problem. Concurrency in sidekiq means threads, this works good up to a certain point.
Given our scale we have a cluster of sidekiq workers split by priority (https://gitlab.com/gitlab-com/infrastructure/issues/2070), for this we use a feature sidekiq-cluster that is shipped with GitLab, this allows us to setup specific queues in specific workers, this way we separate what needs to be acted fast (pushes, for ex) from what can happen down the line.
For clarification, I was not offering this as a solution for GitLab.com but rather for others that may run across this in the future and not be running a sidekiq cluster. Further more, it'd be really beneficial if there was a better guide on how to fix a large sidekiq queue.