Sidekiq queues are growing out of control

I see a lot of UpdatePipelineWorker executions in the log, most likely due to gitlab-org/gitlab-ce@74fd5cab.

What changed in 8.13?

I checked the Sidekiq queue as well and there are indeed a lot of UpdatePipelineWorker jobs, but also ProjectCacheWorker jobs. This last worker had a 45 minute outage starting at 13:00 UTC today:

Grafana snapshot can be found here: http://performance.gitlab.net/dashboard/snapshot/tsbh2paJXO5g2zP0lcROC5Hm2WVRVFyl

It went down when we added more capacity

@yorickpeterse findings are quite interesting

@stanhu

We moved all build and pipeline processing to sidekiq, but not all changes were merged. To be exact, we need this one: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6824.

So, apparently there is a job that got queued and starved all the sidekiq workers (probably computational heavy) for 45 minutes, this led to building a huge queue of jobs to be processed which ended up blocking everything for a couple of hours.

The resolution was twofold:

the job finished, so things got into being processed again.
we doubled sidekiq threads, which gave it way much more power to process delayed jobs quickly.

Further actions are:

Split queues by workers to avoid shoveling everything into default leading to this starvation situation again
Add monitoring for sidekiq queues in prometheus so we can follow progress better
Finish CI implementation of triggering builds async.

To figure out which workers were to blame for the outage I built a list of the workers that use the "default" queue (the one being blocked). This list is as follows:

AdminEmailWorker
BuildCoverageWorker
BuildEmailWorker
BuildFinishedWorker
BuildHooksWorker
BuildSuccessWorker
ClearDatabaseCacheWorker
DeleteUserWorker
ExpireBuildArtifactsWorker
ExpireBuildInstanceArtifactsWorker
GroupDestroyWorker
ImportExportProjectCleanupWorker
IrkerWorker
MergeWorker
NewNoteWorker
PipelineHooksWorker
PipelineProcessWorker
PipelineSuccessWorker
PipelineUpdateWorker
ProjectCacheWorker
ProjectDestroyWorker
PruneOldEventsWorker
RemoveExpiredGroupLinksWorker
RemoveExpiredMembersWorker
RepositoryArchiveCacheWorker
RepositoryCheck::BatchWorker
RepositoryCheck::ClearWorker
RepositoryCheck::SingleRepositoryWorker
RequestsProfilesWorker
StuckCiBuildsWorker
UpdateMergeRequestsWorker

I then wrote the following Ruby script to see which workers were processing data during the outage:

require 'influxdb'
require 'time'

client = InfluxDB::Client.new(
  'gitlab',
  host: 'performance.gitlab.net',
  user: 'root',
  password: 'hunter2'
)

workers = [
  'AdminEmailWorker',
  'BuildCoverageWorker',
  'BuildEmailWorker',
  'BuildFinishedWorker',
  'BuildHooksWorker',
  'BuildSuccessWorker',
  'ClearDatabaseCacheWorker',
  'DeleteUserWorker',
  'ExpireBuildArtifactsWorker',
  'ExpireBuildInstanceArtifactsWorker',
  'GroupDestroyWorker',
  'ImportExportProjectCleanupWorker',
  'IrkerWorker',
  'MergeWorker',
  'NewNoteWorker',
  'PipelineHooksWorker',
  'PipelineProcessWorker',
  'PipelineSuccessWorker',
  'PipelineUpdateWorker',
  'ProjectCacheWorker',
  'ProjectDestroyWorker',
  'PruneOldEventsWorker',
  'RemoveExpiredGroupLinksWorker',
  'RemoveExpiredMembersWorker',
  'RepositoryArchiveCacheWorker',
  'RepositoryCheck::BatchWorker',
  'RepositoryCheck::ClearWorker',
  'RepositoryCheck::SingleRepositoryWorker',
  'RequestsProfilesWorker',
  'StuckCiBuildsWorker',
  'UpdateMergeRequestsWorker'
]

condition = workers.map { |worker| "action = '#{worker}#perform'" }.join(' OR ')

rows = client.query <<-EOF
SELECT SUM("count") AS amount
FROM downsampled.sidekiq_transaction_counts_per_action
WHERE time >= '2016-10-14 13:17:00'
AND time <= '2016-10-14 14:00:00'
AND (#{condition})
GROUP BY action;
EOF

rows.each do |row|
  puts row['tags']['action']
end

This produces:

MergeWorker#perform
ProjectCacheWorker#perform
ProjectDestroyWorker#perform

This suggests that these 3 workers were blocking the entire queue for the duration.

If I add the amount per worker we get the following:

MergeWorker#perform: 2
ProjectCacheWorker#perform: 30
ProjectDestroyWorker#perform: 1

ProjectCacheWorker does a whole bunch of Git related operations. It's not unlikely for this to take quite some time. We also happen to schedule this worker after every push. We might need to break this worker up into smaller workers using separate queues.

Using the range 13:00 to 14:00 the data is as follows:

BuildEmailWorker#perform: 5
GroupDestroyWorker#perform: 1
ImportExportProjectCleanupWorker#perform: 1
MergeWorker#perform: 43
NewNoteWorker#perform: 88
ProjectCacheWorker#perform: 2757
ProjectDestroyWorker#perform: 10

The RC2 contains part of CI changes for asynchronous processing. Ideally the UpdatePipelineWorker should be executed very quickly, but due to some operations still happening during lock that accesses the filesystem I believe that this is the main cause of bigger pressure on sidekiq.

Looking at Grafana I see that in a lot of cases UpdatePipelineWorkers is waiting on FS during lock, and others are waiting on the lock.

There are two cases that are not part of RC2:

First case is blocking on execute_hooks trying to access the repository to fetch commit data. This is solved by this one: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6824,
Second case is waiting on pipeline#merge_request. This is executed to calculate cycle analytics. The fix for that is here: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6896.

Since this is also not the first time that we do heavy duty operations in filesystem I made a MR (still as proof of concept) to prevent such things from happening in the future as it is tempting to put everything in state machine transaction: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6894 :)

Guys, I'm loving the "Everything in the open" mode of work you're employing here; particularly that you've maintained this mode of working even during an outage which led to a partial loss of service. That the subsequent response time was speedy, the analysis was excellent, and that issues to prevent it happening again in the future were immediately created, is just more props to all of you. Great work 💯 👍

Thanks @dchambers, comments like yours encourage us to work out in the open as much as we can.

@dchambers @sytses Agree! And there's a second benefit in the "Everything inthe open": I'm building a big Ruby project, and I'm learning a lot with GitLab project. This kind of issue helps me to learn a lot about Sidekiq (we are using for the first time), and learn Ruby as well. Thank you so much!

Same issue is still occurring today - build is stuck in create state.

https://gitlab.com/gitlab-org/gitlab-ce/issues/23364

Thanks for all the help.

Not the same issue @mkramb https://gitlab.com/gitlab-com/infrastructure/issues/592

Is this rehappening right now? pushed changes take ages to appear in the website

I think it is happening right now:

I'm seeing lots of Sidekiq that are just taking on the order of seconds to finish:

2016-10-18_23:35:33.02164 worker1 sidekiq: 2016-10-18T23:35:27.712Z 64657 TID-owv8qwark ProjectServiceWorker JID-d7dd9f7efa8ca97129dce732 INFO: done: 6.063 sec
2016-10-18_23:35:34.55725 worker1 sidekiq: 2016-10-18T23:35:27.780Z 64657 TID-owv8qwaxo RepositoryUpdateMirrorDispatchWorker JID-b76f05245156212568b1d22a INFO: done: 14.502 sec
2016-10-18_23:35:34.67066 worker1 sidekiq: 2016-10-18T23:35:27.819Z 64657 TID-owv8ckjmw RepositoryUpdateMirrorDispatchWorker JID-5c2ec6b2dacae711f9de5773 INFO: done: 14.477 sec
2016-10-18_23:35:35.07590 worker1 sidekiq: 2016-10-18T23:35:28.109Z 64657 TID-owuvdo7u4 RepositoryUpdateMirrorDispatchWorker JID-c0465088264922771d5a71b1 INFO: done: 18.425 sec
2016-10-18_23:35:35.85024 worker1 sidekiq: 2016-10-18T23:35:28.812Z 64657 TID-owv8vrfes RepositoryUpdateMirrorDispatchWorker JID-c4978b9ce6109d1bf71808f8 INFO: done: 15.224 sec
2016-10-18_23:35:36.24630 worker1 sidekiq: 2016-10-18T23:35:28.914Z 64657 TID-owuwvbx9s RepositoryUpdateMirrorDispatchWorker JID-32234dfa94b54f30730cf95e INFO: done: 9.547 sec
2016-10-18_23:35:36.38883 worker1 sidekiq: 2016-10-18T23:35:29.221Z 64657 TID-owuvhco64 RepositoryUpdateMirrorDispatchWorker JID-c07428c9ae40f1d1641ddccf INFO: done: 8.232 sec
2016-10-18_23:35:36.95961 worker1 sidekiq: 2016-10-18T23:35:32.282Z 64657 TID-owv8ggssc RepositoryUpdateMirrorDispatchWorker JID-5a0e055b5a717a221b1e33f9 INFO: done: 11.293 sec
2016-10-18_23:35:37.03933 worker1 sidekiq: 2016-10-18T23:35:35.075Z 64657 TID-owv89pcng RepositoryUpdateMirrorDispatchWorker JID-50a87237377393d99a3c16f2 INFO: done: 10.661 sec
2016-10-18_23:35:37.05476 worker1 sidekiq: 2016-10-18T23:35:35.867Z 64657 TID-owv8qwbg0 RepositoryUpdateMirrorWorker JID-b587eea8614f7cd870c3fc93 INFO: done: 22.479 sec
2016-10-18_23:35:37.96208 worker1 sidekiq: 2016-10-18T23:35:35.947Z 64657 TID-owv8ux4vw RepositoryUpdateMirrorWorker JID-4adb3bcef35874aef6c80772 INFO: done: 13.848 sec
2016-10-18_23:35:38.18328 worker1 sidekiq: 2016-10-18T23:35:37.165Z 64657 TID-owv8vreqc RepositoryUpdateMirrorWorker JID-1191a431b181678f5ceec9a4 INFO: done: 18.054 sec
2016-10-18_23:35:38.25035 worker1 sidekiq: 2016-10-18T23:35:38.057Z 64657 TID-owv8pq7do RepositoryUpdateMirrorWorker JID-62ac9cda78aa099bbec027ed INFO: done: 13.718 sec
2016-10-18_23:35:38.38067 worker1 sidekiq: 2016-10-18T23:35:38.095Z 64657 TID-owv8vrglk RepositoryUpdateMirrorWorker JID-52357960755c07ac988e8a58 INFO: done: 12.525 sec
2016-10-18_23:35:38.44618 worker1 sidekiq: 2016-10-18T23:35:38.199Z 64657 TID-owuyy0vnw ProjectCacheWorker JID-d49fd8a3b97fb847dba11cb2 INFO: done: 12.656 sec
2016-10-18_23:35:38.54711 worker1 sidekiq: 2016-10-18T23:35:38.227Z 64657 TID-owuz1hrek RepositoryUpdateMirrorWorker JID-de0867bd3155d7118770c26c INFO: done: 12.857 sec
2016-10-18_23:35:38.65154 worker1 sidekiq: 2016-10-18T23:35:38.442Z 64657 TID-owv8vbky4 RepositoryUpdateMirrorWorker JID-a062fc0c9a92d01b6437ff43 INFO: done: 25.261 sec
2016-10-18_23:35:39.04786 worker1 sidekiq: 2016-10-18T23:35:38.551Z 64657 TID-owv8pft10 RepositoryUpdateMirrorWorker JID-7b4ec0248f54a5dc6d78c366 INFO: done: 25.163 sec
2016-10-18_23:35:39.35318 worker1 sidekiq: 2016-10-18T23:35:39.353Z 64657 TID-owv2cersk RepositoryUpdateMirrorWorker JID-5f8f1d29c58907260a285f5e INFO: done: 13.64 sec
2016-10-18_23:35:40.15502 worker1 sidekiq: 2016-10-18T23:35:40.154Z 64657 TID-owv8vbklw RepositoryUpdateMirrorDispatchWorker JID-9e98d3f113f296c5947b7f47 INFO: done: 13.621 sec
2016-10-18_23:35:40.69016 worker1 sidekiq: 2016-10-18T23:35:40.253Z 64657 TID-owv8pe1ms RepositoryUpdateMirrorWorker JID-fe0b222ea869f2d8eedf2ebf INFO: done: 14.531 sec

Just in case you thought it was just RepositoryUpdateMirrorWorker:

2016-10-18_23:36:41.06377 worker1 sidekiq: 2016-10-18T23:36:41.063Z 25118 TID-ouagmm6g4 UpdateMergeRequestsWorker JID-2be0425ed88e74845579d092 INFO: done: 16.673 sec
2016-10-18_23:36:41.27300 worker1 sidekiq: 2016-10-18T23:36:41.272Z 25118 TID-ouagegn20 PostReceive JID-e85a902b3457843ddccadfa5 INFO: done: 15.03 sec
2016-10-18_23:36:41.42784 worker1 sidekiq: 2016-10-18T23:36:41.427Z 25118 TID-ouagmm8zs UpdateMergeRequestsWorker JID-ffe389868c0a65329da6a43d INFO: done: 2.768 sec
2016-10-18_23:36:41.47030 worker1 sidekiq: 2016-10-18T23:36:41.465Z 25118 TID-ouagmm8nk ProjectCacheWorker JID-de761e870af5511be6de25ce INFO: done: 16.956 sec
2016-10-18_23:36:42.68598 worker1 sidekiq: 2016-10-18T23:36:42.685Z 25118 TID-ouagmm95w NewNoteWorker JID-7677fe08b14eb6b9736b1c87 INFO: done: 18.911 sec
2016-10-18_23:36:42.95246 worker1 sidekiq: 2016-10-18T23:36:42.952Z 25118 TID-ouagegq7c ProjectCacheWorker JID-9810a70fe3e9d89c380511e9 INFO: done: 21.783 sec
2016-10-18_23:36:43.35580 worker1 sidekiq: 2016-10-18T23:36:43.355Z 25118 TID-ouagmm7gs ProjectCacheWorker JID-fb7d64a61d6fee131a504865 INFO: done: 19.529 sec
2016-10-18_23:36:43.69832 worker1 sidekiq: 2016-10-18T23:36:43.698Z 25118 TID-ouagpq0hw ProjectCacheWorker JID-f6076f2b0a22445b165d98b1 INFO: done: 14.783 sec
2016-10-18_23:36:43.91334 worker1 sidekiq: 2016-10-18T23:36:43.913Z 25118 TID-ouagmm8nk PipelineProcessWorker JID-d2aa9ef33b76bfe90af421cb INFO: done: 2.442 sec
2016-10-18_23:36:46.39034 worker1 sidekiq: 2016-10-18T23:36:46.390Z 25118 TID-ouagpq04k PipelineHooksWorker JID-66b02089cef669e9a1ee19c4 INFO: done: 5.487 sec
2016-10-18_23:36:46.64157 worker1 sidekiq: 2016-10-18T23:36:46.641Z 25118 TID-ouagmm8nk SystemHookWorker JID-ebe7b86f625820d8f396db3f INFO: done: 2.724 sec
2016-10-18_23:36:47.46473 worker1 sidekiq: 2016-10-18T23:36:47.464Z 25118 TID-ouagmm74k NewNoteWorker JID-a30dea538ce9ae5f2cdb3f17 INFO: done: 9.85 sec
2016-10-18_23:36:47.58637 worker1 sidekiq: 2016-10-18T23:36:47.586Z 25118 TID-ouagmm7mw ProjectCacheWorker JID-1f0a7566b58c25e9124a4886 INFO: done: 23.778 sec
2016-10-18_23:36:47.64261 worker1 sidekiq: 2016-10-18T23:36:47.642Z 25118 TID-ouagmm7z4 ProjectCacheWorker JID-58a7dbfd1bb339359b27bbd7 INFO: done: 23.11 sec
2016-10-18_23:36:47.66931 worker1 sidekiq: 2016-10-18T23:36:47.669Z 25118 TID-ouagehhdo ProjectCacheWorker JID-1569b99471ba007fb4a3a793 INFO: done: 18.625 sec
2016-10-18_23:36:48.08373 worker1 sidekiq: 2016-10-18T23:36:48.082Z 25118 TID-ouagmm7ao ProjectCacheWorker JID-2a74c4193e0c5275fb1f6b1b INFO: done: 19.636 sec
2016-10-18_23:36:48.09242 worker1 sidekiq: 2016-10-18T23:36:48.092Z 25118 TID-ouagehhdo PipelineUpdateWorker JID-b03e64a4395de6ddb45ab02c INFO: done: 0.422 sec
2016-10-18_23:36:50.13544 worker1 sidekiq: 2016-10-18T23:36:50.135Z 25118 TID-ouagppygk ProjectCacheWorker JID-01a9d2dddad98a8860b29813 INFO: done: 19.797 sec
2016-10-18_23:36:50.13672 worker1 sidekiq: 2016-10-18T23:36:50.135Z 25118 TID-ouagppzxc ProjectCacheWorker JID-a749132db7815215e40fdfdf INFO: done: 12.051 sec
2016-10-18_23:36:50.50004 worker1 sidekiq: 2016-10-18T23:36:50.499Z 25118 TID-ouagppz50 ProjectCacheWorker JID-1aa98b551ab3253be2d4d033 INFO: done: 26.694 sec
2016-10-18_23:36:50.65610 worker1 sidekiq: 2016-10-18T23:36:50.590Z 25118 TID-ouagmm9o8 ProjectCacheWorker JID-1983b58b3cd899700450d24f INFO: done: 12.557 sec
2016-10-18_23:36:51.38058 worker1 sidekiq: 2016-10-18T23:36:51.380Z 25118 TID-ouagppyyw ProjectCacheWorker JID-dff0943df90f74f231a6e971 INFO: done: 13.305 sec
2016-10-18_23:36:51.39004 worker1 sidekiq: 2016-10-18T23:36:51.389Z 25118 TID-ouagmm6m8 ProjectCacheWorker JID-567e771bead62033b47f9544 INFO: done: 14.784 sec
2016-10-18_23:36:51.45633 worker1 sidekiq: 2016-10-18T23:36:51.455Z 25118 TID-ouagmm6m8 BuildSuccessWorker JID-3d31988a2815ddf35f9b281c INFO: done: 0.061 sec
2016-10-18_23:36:52.41648 worker1 sidekiq: 2016-10-18T23:36:52.416Z 25118 TID-ouagejafc ProjectCacheWorker JID-395ffc862a4373711592274f INFO: done: 14.364 sec
2016-10-18_23:36:54.20939 worker1 sidekiq: 2016-10-18T23:36:54.209Z 25118 TID-ouagl0oyg PostReceive JID-96c68e3ec43a22afa1d87d94 INFO: done: 23.823 sec
2016-10-18_23:36:54.37360 worker1 sidekiq: 2016-10-18T23:36:54.373Z 25118 TID-ouagpq0b8 ProjectCacheWorker JID-36b57f9b9c0a99adb7c1e7a8 INFO: done: 30.59 sec
2016-10-18_23:36:54.89458 worker1 sidekiq: 2016-10-18T23:36:54.894Z 25118 TID-ouagmm6yg ProjectCacheWorker JID-6083834bfa5fb54852e1a2e4 INFO: done: 16.781 sec
2016-10-18_23:36:54.98304 worker1 sidekiq: 2016-10-18T23:36:54.982Z 25118 TID-ouagmm858 ProjectCacheWorker JID-65fafdd3dbb7397d58483d8b INFO: done: 16.254 sec
2016-10-18_23:36:55.44926 worker1 sidekiq: 2016-10-18T23:36:55.449Z 25118 TID-ouagmm858 ProjectServiceWorker JID-f6d88aaabb13c94bc019a877 INFO: done: 0.461 sec
2016-10-18_23:36:55.46872 worker1 sidekiq: 2016-10-18T23:36:55.468Z 25118 TID-ouagmm7t0 ProjectCacheWorker JID-b78dd65bfccbb7fa86060415 INFO: done: 15.756 sec
2016-10-18_23:36:55.48788 worker1 sidekiq: 2016-10-18T23:36:55.487Z 25118 TID-ouagmm9i4 ProjectCacheWorker JID-bdc42aef8858485007c9f3fa INFO: done: 17.412 sec

Could this have something to do with it?

pid              | 25906
application_name | 
client_addr      | 
state            | active
duration         | 01:18:35.72245
query            | autovacuum: VACUUM ANALYZE public.issues

@northrup killed that VACUUM ANALYZE query and Sidekiq queues went back down to 0 quickly. But this query came right back again.

Killed the query two more times with postgres insisting upon starting it back up every time. I'm currently letting it run with an eye on the database and queues.

@northrup It seems like this is happening again. Pushes are taking over 30 minutes to be reflected in MRs and have builds start for them. Same with pipeline stages starting after the previous stage finishes.

@rabbitfang we are having some massive problems with the file system at the moment, so any activity that deals with touching the FS on GitLab.com right now is being delayed. I apologize.. we've got the status and issue for it up at the GitLab Status Page

https://status.gitlab.com/ indicates CI should be working. An updated status on that would be handy. Best of luck on the the hunt!

We just got another event of this when we spiked up to 30k queued jobs

The resolution was to invoke kill in all the workers sudo gitlab-ctl kill sidekiq, this unlocked the jobs and allowed the rest of the jobs to be slowly processed.

My gut feeling is that we have some process that just hangs the threads forever and the just lock up. So the only resolution is to drop those jobs.

A valid solution to think about here will be to have some form of timeout for when we are processing a job, I don't know how or if it is possible in sidekiq at all, but we should gracefully unlock long running threads that are not allowing anything else to run at all.

I just looked at the Sidekiq default queue, and this is the breakdown:

ProjectCacheWorker: 3329
UpdateMergeRequestsWorker: 1023
PipelineUpdateWorker: 629
PipelineProcessWorker: 512
BuildFinishedWorker: 503
BuildSuccessWorker: 386
BuildHooksWorker: 332
PipelineHooksWorker: 159
NewNoteWorker: 89
MergeWorker: 50
PipelineMetricsWorker: 43
ProjectDestroyWorker: 33
PipelineSuccessWorker: 16
PruneOldEventsWorker: 6
BuildEmailWorker: 2
RepositoryCheck::BatchWorker: 1
GroupDestroyWorker: 1
UpdateAllRemoteMirrorsWorker: 1

The snapshot a few minutes ago:

ProjectCacheWorker: 3608
UpdateMergeRequestsWorker: 2070
PipelineUpdateWorker: 1728
PipelineMetricsWorker: 814
PipelineProcessWorker: 808
BuildHooksWorker: 794
BuildFinishedWorker: 740
PipelineHooksWorker: 669
BuildSuccessWorker: 565
NewNoteWorker: 202
PipelineSuccessWorker: 173
MergeWorker: 161
BuildEmailWorker: 37
ProjectDestroyWorker: 31
RebaseWorker: 11
UpdateAllRemoteMirrorsWorker: 8
PruneOldEventsWorker: 8
GroupDestroyWorker: 2
ExpireBuildArtifactsWorker: 1
ImportExportProjectCleanupWorker: 1
UpdateAllMirrorsWorker: 1
LdapGroupSyncWorker: 1

I ran strace -c -f -p <PID> on Sidekiq, and this is the output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.41  992.311573        6297    157589    129799 open
 23.85  668.480498        4714    141797           clock_gettime
  7.24  202.846111        6495     31230        65 read
  5.86  164.168352        5822     28198           close
  5.73  160.668962       12221     13147      1882 futex
  5.62  157.458659        5726     27500      6585 stat
  4.07  114.181859        6395     17854        90 lstat
  2.05   57.340816        5923      9681      3798 fcntl

Sounds like a lot of filesystem calls.

ProjectCacheWorker is dominating:

ProjectCacheWorker: 9944
UpdateMergeRequestsWorker: 2464
PipelineUpdateWorker: 1160
PipelineProcessWorker: 707
BuildFinishedWorker: 616
BuildSuccessWorker: 504
BuildHooksWorker: 460
PipelineHooksWorker: 403
PipelineMetricsWorker: 273
NewNoteWorker: 243
MergeWorker: 135
PipelineSuccessWorker: 70
ProjectDestroyWorker: 48
RebaseWorker: 17
BuildEmailWorker: 15
UpdateAllRemoteMirrorsWorker: 7
GroupDestroyWorker: 3
ExpireBuildArtifactsWorker: 1

It looks like Sidekiq problem is just a result of FS problems. Since both ProjectCacheWorker and UpdateMergeRequestsWorker are actively using repositories.

Having separate queues on separate servers would make situation a bit better. The key here is to have separate process for every queue, so it could be the same server. The last option is easier to implement of course. Also queues could be split into groups.

We currently have 22,000 Sidekiq jobs in the default queue now. Here are the leaders broken down by project/pipeline/etc. ID:

ProjectCacheWorker 67228: 78
PipelineProcessWorker 4684455: 62
PipelineUpdateWorker 4684545: 61
PipelineUpdateWorker 4683208: 42
PipelineProcessWorker 4682992: 39
UpdateMergeRequestsWorker 1540079: 38
PipelineProcessWorker 4683208: 38
ProjectCacheWorker 1540079: 38
PipelineProcessWorker 4682728: 35
PipelineUpdateWorker 4684882: 34
PipelineUpdateWorker 4684955: 34
PipelineProcessWorker 4682665: 32
PipelineProcessWorker 4684389: 30
PipelineProcessWorker 4683500: 30
PipelineUpdateWorker 4682992: 29
PipelineProcessWorker 4683621: 29
PipelineUpdateWorker 4682728: 23
PipelineProcessWorker 4684545: 23
PipelineUpdateWorker 4684943: 22
PipelineUpdateWorker 4684909: 22
PipelineUpdateWorker 4682665: 22
PipelineProcessWorker 4683777: 22
UpdateMergeRequestsWorker 1278895: 20

GitLab CE leads the pack above.

Overall totals:

ProjectCacheWorker: 11523
UpdateMergeRequestsWorker: 3519
PipelineUpdateWorker: 1794
PipelineProcessWorker: 1064
BuildFinishedWorker: 1051
BuildHooksWorker: 888
BuildSuccessWorker: 792
PipelineHooksWorker: 460
NewNoteWorker: 233
MergeWorker: 207
ProjectDestroyWorker: 45
PipelineMetricsWorker: 38
PipelineSuccessWorker: 22
UpdateAllRemoteMirrorsWorker: 9
GroupDestroyWorker: 6
RebaseWorker: 2
RepositoryCheck::BatchWorker: 1
RepositoryArchiveCacheWorker: 1
IrkerWorker: 1

Mentioned in issue #620 (closed)

Mentioned in issue #623 (closed)

Mentioned in issue support-forum#1170

Current status is that we have a runbook for this and we narrowed the scope to the gitlab_shell queue.

The path forward seems to be that we need to have the ability to pick specific queues and spawn them in isolated processes while evicting these queues from the main sidekiq process. We are doing this manually for now and has been working quite well.

So the solution is affecting both packaging and development.

cc/ @marin @DouweM @smcgivern @yorickpeterse

You can find current state in this graph in performance

The fact that it is under control does not mean it is solved in any case. We just need to final solution to come from development first.

Mentioned in issue #677 (closed)

Mentioned in issue #696 (closed)

I think it's safe to close this one.

closed

FYI we recently had the enqueued stuck at 90K. Running sudo gitlab-ctl kill sidekiq and later upping sidekiq['concurrency'] to 200 got it cleared out in a few hours. Is there a troubleshooting guide we can add these details to?

@bbodenmiller

We don't use concurrency to solve this problem. Concurrency in sidekiq means threads, this works good up to a certain point.

Given our scale we have a cluster of sidekiq workers split by priority (https://gitlab.com/gitlab-com/infrastructure/issues/2070), for this we use a feature sidekiq-cluster that is shipped with GitLab, this allows us to setup specific queues in specific workers, this way we separate what needs to be acted fast (pushes, for ex) from what can happen down the line.

You can see how that works in our public monitoring: http://monitor.gitlab.net/dashboard/db/sidekiq-stats?refresh=5m&orgId=1

So, thanks for your feedback, but that is not a valid solution for our scale.

For clarification, I was not offering this as a solution for GitLab.com but rather for others that may run across this in the future and not be running a sidekiq cluster. Further more, it'd be really beneficial if there was a better guide on how to fix a large sidekiq queue.

Sidekiq queues are growing out of control

Designs

Child items 0

Activity

Admin message

Admin message

Sidekiq queues are growing out of control

Activity