So I think the high level order should be something like:
Enable repmgr, without repmgrd
Enable consul with postgresql service on database nodes
Get omnibus pgbouncer nodes working
Enable repmgrd
Move application to new pgbouncer nodes
The last two should probably happen as close together as possible. Only the last step should have a reasonable potential for service interruption.
Before that, there are a couple of things we should do in staging
Use md5 authentication for repmgr. @jtevnan Had requested this. I've tested this and it seems to work(#2714 (closed)). The alternative is trusting networks or addresses, which require a full restart of postgresql to update.
Switch to repmgr managed replication slots. I believe these have been setup manually in the cluster. repmgr can handle creating them on the master, and on failover.
@_stark or anyone else who's interested, let me know what you think of this. I'm free to chat to go over any questions/concerns/alternative plans.
verify that old pgbouncer nodes are idle and deactivate them
enable repmgrd with failover=manual
after observing repmgrd is stable in production disable failover=manual
During 1-3 a failover would be executed:
By using repmgr promote and repmgr follow plus running chef knifssh commands similar to how the current setup was arranged to point clients to the new database address. The new pgbouncer nodes would automatically pick up the new database but the clients would continue bypassing them.
During 4 failover would be the same but some clients using the new pgbouncer nodes would need to be excluded from the knifessh commands?
From 5 onwards until repmgrd is activated a failover would only require repmgr commands and no knifessh commands should be needed.
After 7 we might want to schedule a manual failover to be sure it actually works...