As a trial by fire, lets try out the steps here: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1807#note_31216296 on staging to get a first impression of what HA postgres could look like in the future. This is a WIP, so we will be doing it manually (as discussed with @pcarranza, @ibaum and @marin) just so we can get a feeling for our solution, its limitations and possible effort estimations.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
The attributes for the standby servers should contain all the attributes that are needed. Currently, it just contains the bare minimum that postgresql will need to be able to communicate with the primary. But a gitlab-ctl reconfigure would blow that away and break the standby.
The build is supposed to symlink the repmgr binaries from the postgresql install directory, to /opt/gitlab/embedded/bin, but this didn't appear to work.
The repmgr command for registering the master generates a warning about transaction already in progress. This is only a warning, but it gives no other indication that the command worked. So the documentation should warn about this.
We found that if streaming replication is already up and running, you do not need to run the standby clone step. You should be able to perform the remaining steps and get a working repmgr cluster. This really simplifies adding repmgr to an existing cluster.
Further note, the pg_hba.conf of the node that is already synced will need to be updated. But if we have that automated like we should in omnibus-gitlab, that will not be an issue.
We were able to successfully add the 2 nodes via a complete sync and the thrid node by adapting its pg_hba.conf and otherwise treating it as we did the others.
we followed the above scenario but ran into some issues:
db stop was not an issue
promoting the secondary worked as well
Staging secondary-db root@db4.staging.gitlab.com:~# su - gitlab-psql -c '/opt/gitlab/embedded/bin/repmgr -f /etc/gitlab/repmgr.conf standby promote'ERROR: connection to database failed: could not connect to server: Connection refused Is the server running on host "10.123.2.4" and accepting TCP/IP connections on port 5432?NOTICE: promoting standbyNOTICE: promoting server using '/opt/gitlab/embedded/bin/pg_ctl -D /var/opt/gitlab/postgresql/data promote'server promotingNOTICE: STANDBY PROMOTE successful
when telling the secondaries to follow we had some issues:
the initial follow command:
su - gitlab-psql -c '/opt/gitlab/embedded/bin/repmgr -h 10.123.2.5 -U repmgr -d repmgr -D /var/opt/gitlab/postgresql/data/ -f /etc/gitlab/repmgr.conf standby follow'NOTICE: restarting server using '/opt/gitlab/embedded/bin/pg_ctl -w -D /var/opt/gitlab/postgresql/data/ -m fast restart'waiting for server to shut down............................................................... failedpg_ctl: server does not shut downERROR: unable to restart server
failed while trying to shut the db down.
we manually shutdown the db and executed the same command, which worked fine:
Staging PRIMARY-DB root@db3.staging.gitlab.com:~# su - gitlab-psql -c '/opt/gitlab/embedded/bin/repmgr -h 10.123.2.5 -U repmgr -d repmgr -D /var/opt/gitlab/postgresql/data/ -f /etc/gitlab/repmgr.conf standby follow'NOTICE: restarting server using '/opt/gitlab/embedded/bin/pg_ctl -w -D /var/opt/gitlab/postgresql/data/ -m fast restart'pg_ctl: PID file "/var/opt/gitlab/postgresql/data/postmaster.pid" does not existIs server running?starting server anywaywaiting for server to start....LOG: database system was shut down at 2017-06-07 12:49:49 GMTLOG: entering standby modeFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upFATAL: the database system is starting upLOG: consistent recovery state reached at 347/4C0000D0LOG: invalid record length at 347/4C0000D0: wanted 24, got 0FATAL: the database system is starting upLOG: database system is ready to accept read only connectionsLOG: fetching timeline history file for timeline 2 from primary serverLOG: started streaming WAL from primary at 347/4C000000 on timeline 1LOG: replication terminated by primary serverDETAIL: End of WAL reached on timeline 1 at 347/4C0000D0.LOG: new target timeline is 2LOG: restarted WAL streaming at 347/4C000000 on timeline 2LOG: redo starts at 347/4C0000D0 doneserver startedNOTICE: STANDBY FOLLOW successful
however we ran into errors popping up in the logs which @ibaum was taking a look at:
ERROR: relation "pg_stat_statements" does not exist at character 502STATEMENT: SELECT pg_authid.rolname as user, pg_database.datname as database, pg_stat_statements.queryid, left(pg_stat_statements.query,250) as query, pg_stat_statements.calls, pg_stat_statements.total_time as time_milliseconds, pg_stat_statements.rows, pg_stat_statements.shared_blks_hit, pg_stat_statements.shared_blks_read, pg_stat_statements.shared_blks_dirtied, pg_stat_statements.shared_blks_written, pg_stat_statements.temp_blks_read, pg_stat_statements.temp_blks_written FROM pg_stat_statements JOIN pg_authid ON pg_authid.oid = pg_stat_statements.userid JOIN pg_database ON pg_database.oid = pg_stat_statements.dbidERROR: permission denied for relation pg_stat_replicationSTATEMENT: SELECT *, pg_current_xlog_location(), pg_xlog_location_diff(pg_current_xlog_location(), replay_location)::float FROM pg_stat_replicationERROR: relation "pg_stat_statements" does not exist at character 502STATEMENT: SELECT pg_authid.rolname as user, pg_database.datname as database, pg_stat_statements.queryid, left(pg_stat_statements.query,250) as query, pg_stat_statements.calls, pg_stat_statements.total_time as time_milliseconds, pg_stat_statements.rows, pg_stat_statements.shared_blks_hit, pg_stat_statements.shared_blks_read, pg_stat_statements.shared_blks_dirtied, pg_stat_statements.shared_blks_written, pg_stat_statements.temp_blks_read, pg_stat_statements.temp_blks_written FROM pg_stat_statements JOIN pg_authid ON pg_authid.oid = pg_stat_statements.userid JOIN pg_database ON pg_database.oid = pg_stat_statements.dbidERROR: permission denied for relation pg_stat_replicationSTATEMENT: SELECT *, pg_current_xlog_location(), pg_xlog_location_diff(pg_current_xlog_location(), replay_location)::float FROM pg_stat_replicationERROR: permission denied for relation pg_stat_replicationSTATEMENT: SELECT *, pg_current_xlog_location(), pg_xlog_location_diff(pg_current_xlog_location(), replay_location)::float FROM pg_stat_replication
ERROR: relation "pg_stat_statements" does not exist at character 502
This seem to be due to the pg_stat_statements module not being installed. This is not a module we install in a stock GitLab install. The repmgr docs don't specify that it should be installed. I also don't see any code in repmgr referencing that, so we would need to trace down what actually is trying to see that.
ERROR: permission denied for relation pg_stat_replication
On a test instance, with a bare bones user, or the repmgr user, I was able to run this query. So I'm still unsure as to what is getting that error.
We would like to failover again, to verify that those errors are transient, after which we can close this particular issue.
Next step would be to test the ability of repmgrd (the automated failover mechanism) but since @ibaum is out till monday, I am working on some other issues.
@marin that's from the Build perspective, but I assume we can do something from the infrastructure perspective to get us to a better place, or I may be wrong and we need to wait on build