More reliable Sidekiq queue
Sidekiq uses BRPOP
command to pop a job off the queue in Redis. It means that job is removed from the queue, usually, that's not a problem because if our job raises an error the Sidekiq puts it back with "failed" status(to another queue) and it will be retried later. But it become a problem when we kill sidekiq or when it segfaults or crashes. In this case we lose that job forever.
This made me to take a look at how we kill our sidekiq processes. We use https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/sidekiq_middleware/memory_killer.rb for that. There is a variable SIDEKIQ_MEMORY_KILLER_SHUTDOWN_WAIT
which means how long we wait from stopping accepting new jobs until we kill process. So we give a job at maximum 30 seconds to finish its stuff. After this we will kill that job and we'll never retry it (see above the reason). I think there is no reason to be so aggressive because sometimes our jobs exceed that time http://performance.gitlab.net/dashboard/db/sidekiq-workers. I propose to set it to 90 seconds.
The second optimization would be increase the value of SIDEKIQ_MEMORY_KILLER_GRACE_TIME
, which now is 15 minutes. That means that we wait 15 minutes after we write a warning to the log this thread will shut down PID #{Process.pid} - Worker #{worker.class} - JID-# in #{GRACE_TIME} seconds
. Basically, with current implementation, I don't see any reason to keep this fat process running. For now I propose to set it to several seconds. Later, we can improve it by adding one more get_rss
call after GRACE_TIME to make sure that memory was flooded by memory leaks, not by one fat job that can free resources afterwords. That will make our shots more accurate. But I don't think the last one will have practical benefits in real life.
Summary
Now we have following picture: process exceeds memory limit -> we write warning to the log -> wait 15 minutes -> stop accepting new jobs -> wait 30 seconds -> shutting everything down.
I propose to set SHUTDOWN_WAIT to 90 seconds (default 30) and GRACE_TIME to 10 seconds(now 15 minutes). What to expect?
- Decreasing memory consumption by not keeping fat processes 15 minutes before killing.
- More reliable queue by letting job more time to finish its stuff.
It should close issue https://gitlab.com/gitlab-org/gitlab-ce/issues/23646. And I'm also sure that everyone saw lost jobs, at least I saw many times.
/cc @jacobvosmaer-gitlab @pcarranza @jnijhof