More reliable Sidekiq queue

We're killing sidekiq processes approximately 500 times per day

It's an interesting one to try

@pcarranza can we schedule an experiment?

@yorickpeterse would you like to take a byte on this one?

removed assignee

I am going to unassign this as I won't be able to work on this for a while, and I think it's better for another developer to take care of this.

The system we should probably go with is one used by AWS SQS, and probably Sidekiq enterprise:

Whenever you "pop" a job you really push it to another queue in an atomic manner. SQS calls these "in-flight" queues. For example, there's the "foo" queue and the "foo in-flight" queue. Redis supports this using https://redis.io/commands/rpoplpush. Once pushed to this in-flight queue you don't pop the job, instead you somehow "check it in". Once you're done you check it out, then pop it. A Sidekiq worker will run periodically and move any jobs that have been in-flight for a certain period back to their regular queue. In this setup care has to be taken to ensure that the job timeout is greater than the job processing time. An easy approach is to do this once an hour, and try to ensure jobs never take longer than an hour.

Care should be taken to ensure that this setup still supports the use of other Sidekiq extensions such as limit-fetch. I also propose adding this to CE instead of EE, so all our users can benefit from an actually reliable Sidekiq.

Once this has been implemented we can simply start using SIGKILL to terminate Sidekiq, instead of having to rely on the unreliable SIGTERM.

added availability experiment labels

It's still a little early as a full day hasn't yet passed but after #2070 (closed) the rate seems to have dropped to ~200/day.

@vsizov can we review this in light of the fact that we split our sidekiq fleet by priority? Perhaps we could fiddle with settings to better fit each type.

changed the description

@omame Yes, I think we can adjust SHUTDOWN_WAIT and GRACE_TIME for every type separately but I think, it will make things complex with almost no benefits.

@vsizov Are you saying that we can close this issue for now?

@omame Nope, I didn't say that, I only repeated what I said in the description! We can only close it if we use Sidekiq enterprise or any other reliable queue.

We can only close it if we use Sidekiq enterprise or any other reliable queue

This sounds like a wont fix to me.

Now I'm really tempted to just close this issue.

@vsizov what would be the actual action here? to use sidekiq enterprise?

I think the action would be to use or implement something similar to sidekiq-reliable-fetch gem: https://github.com/TEA-ebook/sidekiq-reliable-fetch

I suspect we would have to eliminate Sidekiq Limit Fetch since only one middleware can be used.

@stanhu thanks, that makes sense.

Should we move this to the gitlab-ce issue tracker then?

Sorry, by "We can only" I mean only alternative solution I know. The main solution is in the description:

I propose to set SHUTDOWN_WAIT to 90 seconds (default 30) and GRACE_TIME to 10 seconds(now 15 minutes).

Ah, that makes more sense.

Should we have this setup in omnibus?

@pcarranza I created this task in "gitlab-com/infrastructure" because I would like to set these values as ENV variables to prove that it works better and then we could change that on the code level. Sorry, I don't know how difficult to do that on our infrastructure but this was the point. See our previous discussion https://gitlab.com/gitlab-com/infrastructure/issues/943#note_20739219

@omame

Yes, I think we can adjust SHUTDOWN_WAIT and GRACE_TIME for every type separately but I think, it will make things complex with almost no benefits.

I assume this part confused you @omame Clarification: I think we don't have to have different values for every job type, it's enough to set them globally.

Ok, this makes more sense.

So what we would need to do is configure environment variables for the sidekiq executors such that:

SHUTDOWN_WAIT=90
GRACE_TIME=10

What we expect to see is sidekiq shutting down sooner when it gets the signal from the OOM killer.

The only thing missing here is how do we monitor this to see that it's actually working as expected.

Then the next step would be to make this the default in Omnibus.

Is this a correct statement/summary @vsizov?

@pcarranza Thanks for the good question. To monitor the changes I propose to use the link I provided early https://gitlab.com/gitlab-com/infrastructure/issues/943#note_20721533 It shows the total OOM killer rate

And also that would be good to see the total memory consumption by Sidekiq workers or average one. The last one I found in fleet-everview but it's empty, not sure why.

And the third parameter is "amount of lost jobs" we can't monitor at the moment as it would require special crafted service which we don't have.

So we actually have two different experiments:

SHUTDOWN_WAIT

Set 90 seconds (currently 30)

SIDEKIQ_MEMORY_KILLER_SHUTDOWN_WAIT=90

No monitoring right now, but we have jobs that run more than 30 seconds so let's give them more time (well we had more them at the time when this issue is created but now this is a lot less important)

GRACE_TIME

SIDEKIQ_MEMORY_KILLER_GRACE_TIME=10     (currently 900)

I propose to set it to 10 (maybe gradually, say 60 seconds as an intermediate value)

Monitoring: It should slightly increase the OOM killer rate but the memory consumption should be decreased a lot. At the moment I don't see any working monitoring for Sidekiq memory, so we can't run this experiment.

see https://docs.gitlab.com/ce/administration/operations/sidekiq_memory_killer.html for details.

Should this issue be re-opened?

reopened

@vsizov @pcarranza I'm scared of changing the grace time to 10 seconds, because of issues like https://gitlab.com/gitlab-org/gitlab-ee/issues/3347. The LDAP sync workers can take up a lot of memory, which is a bug, but until we have fixed that, it's preferable for customers if they go over the memory limit and still have 15 minutes to try to finish, than if they are hard-killed after 10 seconds.

@ayufan This issue was moved to https://gitlab.com/gitlab-org/gitlab-ce/issues/36791.

closed

assigned to @omame

changed milestone to %WoW ending 2017-09-12

Please don't reopen issues, just create new ones, else we can't really track work with how we manage milestones.

reopened

closed

More reliable Sidekiq queue

Summary

REFERENCES:

Designs

Child items ...

Activity

SHUTDOWN_WAIT

GRACE_TIME

Admin message

Admin message

More reliable Sidekiq queue

Summary

REFERENCES:

Activity

SHUTDOWN_WAIT

GRACE_TIME