configure repmgr, repmgrd, pgbouncer from omnibus in prod
In version gitlab-ee 9.5 is the first version of the full failover stack we are already using in staging (https://gitlab.com/gitlab-com/infrastructure/issues/2422)
The majority of the changes (repmgr(d) settings, pgbouncer configuration) can be done during normal operation and need no downtime.
EXCEPTION:
Inorder for all the replication side to work we need to load a shared library during startup:
"shared_preload_libraries": "pg_stat_statements,repmgr_funcs"
This can only be done during as per the documentation:
Those libraries must be loaded at server start through this parameter.
For the secondaries this has to be done ahead of time and can be done as follows without downtime:
-
db4 (--> is being replaced by postgres-01 and postgres-02)
-
db3 (will become new master)
-
remove them from the load balancing configuration in chef: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gitlab-base.json#L256
-
wait until the change has been propigated
-
upgrade the package to the latest omnibus version:
9.5.0-ee.0
-
run
gitlab-ctl reconfigure
-
add the
repmgr_funcs
toshared_preload_libraries
-
run
gitlab-ctl reconfigure
-
ensure that the secondary is syncing again.
-
add to the load balancing configuration again
Since we have to remove the primary due to the disk configuration (https://gitlab.com/gitlab-com/infrastructure/issues/2168) we will not activate this setting, but failover to the secondary db3. This will cause a very short dowtime.
- stop all sidekiq/sidekiq-cluster nodes to minimize db load
- pause the connections on pgbouncer
- verify primary/secondary log positions
- stop the primary db
- promote db3 to primary
- (change db4 to follow db3) --> ensure that postgres-01 and postgres-02 are still following
- reconfigure pgbouncer on db1 to connect to db3
- resume connections
- start all sidekiq/sidekiq-cluster nodes