configure repmgr, repmgrd, pgbouncer from omnibus in prod

mentioned in issue #2171

changed the description

mentioned in issue #2583 (closed)

mentioned in issue #2422 (closed)

The change document: https://gitlab.com/gitlab-com/infrastructure/issues/2583

changed milestone to %WoW ending 2017-08-29

changed milestone to %WoW ending 2017-08-22

changed the description

Before "stop the primary db" I think you want to insert a step like

On Master: SELECT pg_current_xlog_location(); On Slave: SELECT pg_last_xlog_replay_location();

They should be very close or even the same. You could probably get by by just running pg_last_xlog_replay_timestamp() and comparing with the current time.

Then after shutting down you should check pg_last_xlog_replay_location() on both slaves and check that they're either the same or that db3 (the new master) is ahead of db4.

@_stark is this what we are looking for? https://prometheus.gitlab.com/graph?g0.range_input=1h&g0.expr=pg_replication_lag&g0.tab=1

or https://performance.gitlab.net/dashboard/db/postgres-stats?refresh=5m&panelId=11&fullscreen&orgId=1

or https://performance.gitlab.net/dashboard/db/postgres-stats?refresh=5m&panelId=16&fullscreen&orgId=1

changed the description

They are and we should probably check those stats before initiating the entire process.

But I think we should go to the source at the time of the actual failover, especially for the second check before promoting db3.

If db3 is behind db4 then we could promote db4 instead though it seems unlikely, they should both have no trouble being completely caught up with an idle primary.

repmgr does automatically use pg_rewind to resync the new standbys if the new primary is behind the other standbys but that's an added complication and it implies there's some data loss so it would be better to avoid it entirely.

@_stark we are not using repmgr for this failover since we need the shared_library to be loaded.

I added the step verify primary/secondary log positions after pausing the pgbouncer -> so no more writes should be made, and the pg_current_xlog_location and the pg_last_xlog_replay_location should be the same. Correct?

There are two separate checks here.

One is that the lag is small after pausing pgbouncer. There could be things like autovacuum or something so it might not be 0 and it might even start going up but as long as it is small enough that the replay is caught up to where you paused pgbouncer it should be fine. For this the prometheus stat might suffice.

The other is after shutting down the master to compare the two (or three) standbys to make sure they're either the same or that we're failing over to the one with the least lead. I would expect them to all be the same but it doesn't matter as long as we're failing over to the one in the lead. This is the one we're it's critical to check the exact number from pg_last_xlog_replay_location().

/cc @ibaum

changed milestone to %WoW ending 2017-08-29

mentioned in issue #2597 (closed)

changed the description

caveates:

pgbouncer is not 100% correct after coming out of the package:

/var/opt/gitlab/pgbouncer/databases.ini needs to be changed so that the sidekiq entry also has the pool size constraint:

gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 auth_user=XXX pool_size=150 dbname=gitlabhq_production

and it is still unclear if the pg_auth entries here: /var/opt/gitlab/pgbouncer/pg_auth need the md5 prefix or not

pgbouncer is not 100% correct after coming out of the package

gitlab-org/omnibus-gitlab#2703 (closed) will resolve this

and it is still unclear if the pg_auth entries here: /var/opt/gitlab/pgbouncer/pg_auth need the md5 prefix or not

If they don't have the prefix, then they are the password stored in plaintext. I think we should be using md5

marked the checklist item remove them from the load balancing configuration in chef: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/gitlab-base.json#L256 as completed

marked the checklist item wait until the change has been propigated as completed

marked the checklist item upgrade the package to the latest omnibus version: 9.5.0-ee.0 as completed

marked the checklist item run gitlab-ctl reconfigure as completed

marked the checklist item add the repmgr_funcs to shared_preload_libraries as completed

marked the checklist item run gitlab-ctl reconfigure as completed

marked the checklist item ensure that the secondary is syncing again. as completed

marked the checklist item add to the load balancing configuration again as completed

Some findings from testing with replication slots:

omnibus restarts the master db when the pg_hba.conf is changed: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2713
this is necessary regardless since the pg_hba config is needed for this trust:

     # repmgr 
    +local replication gitlab_repmgr  trust
    +host replication gitlab_repmgr 127.0.0.1/32 trust 
    +host replication gitlab_repmgr 10.129.1.0/24 trust
    +local gitlab_repmgr gitlab_repmgr  trust
    +host gitlab_repmgr gitlab_repmgr 127.0.0.1/32 trust
    +host gitlab_repmgr gitlab_repmgr 10.129.1.0/24 trust

slaves are restarted once they are set to follow the master (even if they already are, this is done by repmgr)
we can get around this by removing the nodes from the read load balancer config and restarting them when they have no connections
very time consuming

@ibaum will test out a password based approach for avoiding downtime on the master: this would involve setting up a pg_pass file for the repmgr user: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2714

added blocked label

removed critical label

removed requires downtime label

to enable this in prod without downtime the previous issues that @ibaum is working on have to be solved.

changed milestone to %WoW ending 2017-09-19

added moved 1 label

mentioned in issue #2765

mentioned in issue #2694 (closed)

assigned to @_stark

changed milestone to %WoW

I doubt this will happen in this WoW.

mentioned in issue #2547

marked this issue as related to #2547

changed milestone to %WoW ending 2017-10-03

added moved 2 moved 3 moved 4 labels

changed milestone to %WoW

configure repmgr, repmgrd, pgbouncer from omnibus in prod

Designs

Child items ...

Activity

Admin message

Admin message

configure repmgr, repmgrd, pgbouncer from omnibus in prod

Relates to

Activity