Improve LDAP sync worker performance: memory usage and runtime
A customer with about 7500 LDAP users and about 1300 LDAP group links was seeing an issue similar to https://gitlab.com/gitlab-org/gitlab-ce/issues/35531, where LdapAllGroupsSyncWorker
s and LdapSyncWorker
s would start, quickly hit the 1GB Sidekiq Memory Killer limit, not be done after 15 minutes, automatically be killed and requeued, start again after sidekiq restarts, hit the same memory limit, etc, etc, without ever being able to fully complete.
On top of that, new LdapSyncWorker
s would be queued every night at 1:30am, and new LdapAllGroupsSyncWorker
s as much as every hour, even though the earlier workers likely hadn't had a chance to finish yet. The customer in question ended up with 4 LdapAllGroupsSyncWorker
s and 3 LdapSyncWorker
s in this "infinite loop" situation.
The fact that these 7 workers were now concurrently hammering the LDAP server was of course not making the situation any better, because increased LDAP load means higher LDAP request times, which means slower workers, since the workers do a lot of LDAP requests.
All of the above meant that their Sidekiq workers were restarting every ~16 minutes with 7 out of the 25 available threads hogged by these LDAP workers, resulting on about 3500 jobs waiting in the queue.
We should:
- Make sure only one worker of each type is running at the same time, using a lease
- Get the memory usage of both workers down, so that the RSS limit isn't hit
- Get the runtime of both workers down, so that they finish within 15 minutes
/cc @lbot @mydigitalself