Sidekiq sometimes goes a bit too crazy and can consume all available connections, blocking Unicorns from doing anything. To remedy this we should configure pgbouncer to use a custom limit of connections for both Unicorn and Sidekiq.
From what I can tell we can do this as following:
We lower the default pool size to 100
We configure Sidekiq to use database gitlabhq_production_sidekiq
In Pgbouncer we map this database to gitlabhq_production
Because the pool size is per user + database this should lead to both Unicorn and Sidekiq being able to use up to 100 connections. If we want to give Sidekiq more connections we can configure this in the database mapping entry.
This however depends on Sidekiq applying the limits to the "source" database, and not the target database.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
The most difficult part here will be figuring out how many connections we need for both Unicorn and Sidekiq.
For Unicorn this is fairly easy: we simply count the number of Unicorn processes we have, and we give it an equal amount of maximum connections. This allows all workers to run in parallel (in theory).
For Sidekiq it's a bit more tricky. Since Sidekiq is multi-threaded the number of theoretical concurrent connections can easily be in the hundreds, far more than PostgreSQL would allow (300). To get a random sample I ran this:
select count(*) from pg_stat_activity where application_name ilike '%sidekiq%';
This outputs roughly 60 most of the time. This means we should in theory be able to work with e.g. 100 connections, or maybe a little less. Worst case scenario some queries take a second or two longer in the event of a spike.
Unfortunately Omnibus does not support setting dbname and pool_size per database entry, so we'd have to also add that if we want to use this setup with omnibus pgbouncer.
For Unicorn we have 210 processes at the moment. I however strongly doubt we need that many connections since we use statement pooling (meaning a connection is available again after a query finishes running). Looking at our secondaries (only used by Unicorn) we never peak over 30 active connections in the past 24 hours. As such I think it's fairly safe to set the default pool size to 100, then give Sidekiq 150 connections (if we even need that many). This yields a total of 250, leaving 50 remaining connections for everything else (our PostgreSQL limit is 300).
We add the entry gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 auth_user=pgbouncer pool_size=150 dbname=gitlabhq_production to /etc/pgbouncer/pgbouncer.ini
We update config/database.yml for all Sidekiq nodes so they use the database gitlabhq_production_sidekiq, we then reload them
We adjust default_pool_size to 100 in /etc/pgbouncer/pgbouncer.ini, then run /etc/init.d/pgbouncer reload
We HUP Unicorn for good measurement
The pgbouncer settings would have to be applied manually on db1, db3 and db4. None of this requires any downtime. Unfortunately I can't SSH into the Sidekiq nodes (even with the VPN enabled), so somebody else would have to make those changes.