High number of Sidekiq jobs in use_key and git_garbage_collect workers

changed the description

Why is use_key growing? Is it because it's low-priority, so it pretty much never gets executed? This implementation hasn't really changed in months.

I think my real question here is maybe for @pcarranza and team: what can we do to validate the priorities we've set, and that the lowest-priority queues are getting worked on at least as fast as they are being added to? Is there an existing metric we should use, or can we add a new one?

This is a slight tangent, because they are scheduled in bulk, but: a background migration may create a million jobs. They should be lower priority than any other job, but we still expect them to be executed within a reasonable time frame (say, a month).

getting worked on at least as fast as they are being added to

I'm not talking about over a small sample period here, but more like a day+.

@smcgivern sidekiq execution at this stage is a massive mess where one queue can block the other, it's not built with this scale in mind at all and it keeps showing.

I would wait for https://gitlab.com/gitlab-com/infrastructure/issues/1945 to then review which queues are best-effort and which are the ones that actually need attention.

I think that the challenge then will be on the side of: how can we monitor the execution of real-time and best-effort queues in a way that makes sense without having to reverse engineer Redis to get visibility. Is there an issue to add sidekiq metrics now that we are starting to use Prometheus metrics in the application? Can we have metrics such as execution time in buckets per queue, the number of processed jobs per queue and the number of inserted jobs per queue?

Additionally, @smcgivern, is there a way of having sidekiq not executing a given queue?

I would like to be able of keep the best effort sidekiq fleet without running the queues that have specific workers, but I'm not sure if the application is capable of this.

Thanks @pcarranza. These are all definitely worth thinking about.

Additionally, @smcgivern, is there a way of having sidekiq not executing a given queue?

It can only execute a specific set of queues, but I take it you want the inverse? (--except queue1,queue2 or whatever.)

Is there an issue to add sidekiq metrics now that we are starting to use Prometheus metrics in the application? Can we have metrics such as execution time in buckets per queue, the number of processed jobs per queue and the number of inserted jobs per queue?

Not that I know of - @mydigitalself might have an idea.

Worth noting that after @stanhu dropped those queues, https://performance.gitlab.net/dashboard/db/sidekiq-stats?refresh=5m&orgId=1 got so much lighter that we thought it was broken.

@smcgivern

It can only execute a specific set of queues, but I take it you want the inverse? (--except queue1,queue2 or whatever.)

Correct, I want to exclude execution of the queues that have specific workers for them as a second step.

@smcgivern

Worth noting that after @stanhu dropped those queues, https://performance.gitlab.net/dashboard/db/sidekiq-stats?refresh=5m&orgId=1 got so much lighter that we thought it was broken.

That's exactly right, all the metrics we have is what is enqueued. When it's working perfectly fine nothing is enqueued. It's like measuring the space left on a wall instead of the size of a frame, it's indirect.

That's why I would like to have direct metrics on sidekiq jobs execution, to see direct metrics of how is it behaving, instead of the remaining load left to process.

I believe @bjk-gitlab is working on getting Sidekiq metrics into Prometheus: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2535

Correct, I want to exclude execution of the queues that have specific workers for them as a second step.

@pcarranza at the moment, how are you planning to implement https://gitlab.com/gitlab-com/infrastructure/issues/1945? Is it by passing a single group of queues to bin/sidekiq-cluster, dependent on the host? And then you'd have another host (or set of hosts) which do 'everything else'?

I think we could implement an option to do what you want by parsing our sidekiq_queues.yml, and sending all of those excluding the ones passed to the native Sidekiq CLI (as our sidekiq-cluster script is a wrapper around that anyway).

I also wonder if we should define these groups of queues in the application itself, or would you like to be able to tweak them initially? I'm thinking about the case where we add a new queue which needs to be treated the same as an existing queue, for instance.

@smcgivern

Is it by passing a single group of queues to bin/sidekiq-cluster, dependent on the host? And then you'd have another host (or set of hosts) which do 'everything else'?

Yes, that's basically the idea, but the "everything else" is actually everything.

What we are doing is have a set of nodes for pure sidekiq, and sidekiq-cluster in a different set of nodes where we control what queues are realtime.

or would you like to be able to tweak them initially?

They change a lot from deployment to deployment, so I rather keep the control of the queues.

I created https://gitlab.com/gitlab-org/gitlab-ee/issues/2893!

So I can use 2 thumbs

I think we can close this?

Reopen if you disagree

closed

High number of Sidekiq jobs in use_key and git_garbage_collect workers

Designs

Child items ...

Activity

Admin message

Admin message

High number of Sidekiq jobs in use_key and git_garbage_collect workers

Activity