Yorick: @#$(*&#$(* somehow I overlooked that when reviewing the MRTo be fair, I think that `needed?` check really should be embedded in `execute`otherwise we will keep making the same mistake
I think my real question here is maybe for @pcarranza and team: what can we do to validate the priorities we've set, and that the lowest-priority queues are getting worked on at least as fast as they are being added to? Is there an existing metric we should use, or can we add a new one?
This is a slight tangent, because they are scheduled in bulk, but: a background migration may create a million jobs. They should be lower priority than any other job, but we still expect them to be executed within a reasonable time frame (say, a month).
@smcgivern sidekiq execution at this stage is a massive mess where one queue can block the other, it's not built with this scale in mind at all and it keeps showing.
I think that the challenge then will be on the side of: how can we monitor the execution of real-time and best-effort queues in a way that makes sense without having to reverse engineer Redis to get visibility. Is there an issue to add sidekiq metrics now that we are starting to use Prometheus metrics in the application? Can we have metrics such as execution time in buckets per queue, the number of processed jobs per queue and the number of inserted jobs per queue?
Additionally, @smcgivern, is there a way of having sidekiq not executing a given queue?
I would like to be able of keep the best effort sidekiq fleet without running the queues that have specific workers, but I'm not sure if the application is capable of this.
Thanks @pcarranza. These are all definitely worth thinking about.
Additionally, @smcgivern, is there a way of having sidekiq not executing a given queue?
It can only execute a specific set of queues, but I take it you want the inverse? (--except queue1,queue2 or whatever.)
Is there an issue to add sidekiq metrics now that we are starting to use Prometheus metrics in the application? Can we have metrics such as execution time in buckets per queue, the number of processed jobs per queue and the number of inserted jobs per queue?
That's exactly right, all the metrics we have is what is enqueued. When it's working perfectly fine nothing is enqueued. It's like measuring the space left on a wall instead of the size of a frame, it's indirect.
That's why I would like to have direct metrics on sidekiq jobs execution, to see direct metrics of how is it behaving, instead of the remaining load left to process.
Correct, I want to exclude execution of the queues that have specific workers for them as a second step.
@pcarranza at the moment, how are you planning to implement https://gitlab.com/gitlab-com/infrastructure/issues/1945? Is it by passing a single group of queues to bin/sidekiq-cluster, dependent on the host? And then you'd have another host (or set of hosts) which do 'everything else'?
I think we could implement an option to do what you want by parsing our sidekiq_queues.yml, and sending all of those excluding the ones passed to the native Sidekiq CLI (as our sidekiq-cluster script is a wrapper around that anyway).
I also wonder if we should define these groups of queues in the application itself, or would you like to be able to tweak them initially? I'm thinking about the case where we add a new queue which needs to be treated the same as an existing queue, for instance.
Is it by passing a single group of queues to bin/sidekiq-cluster, dependent on the host? And then you'd have another host (or set of hosts) which do 'everything else'?
Yes, that's basically the idea, but the "everything else" is actually everything.
What we are doing is have a set of nodes for pure sidekiq, and sidekiq-cluster in a different set of nodes where we control what queues are realtime.
or would you like to be able to tweak them initially?
They change a lot from deployment to deployment, so I rather keep the control of the queues.