DB HA in production

changed the description

He's going to familiarize himself with the current state of things.

Our plan for tomorrow is to verify everything in staging, and start working on a plan for rolling out the components into production.

The meta ticket for the staging rollout is #2171

And @jtevnan did some prep work for the production databases in #2581

@_stark please ping anyone from the team for questions.

So I think the high level order should be something like:

Enable repmgr, without repmgrd
Enable consul with postgresql service on database nodes
Get omnibus pgbouncer nodes working
Enable repmgrd
Move application to new pgbouncer nodes

The last two should probably happen as close together as possible. Only the last step should have a reasonable potential for service interruption.

Before that, there are a couple of things we should do in staging

Use md5 authentication for repmgr. @jtevnan Had requested this. I've tested this and it seems to work(#2714 (closed)). The alternative is trusting networks or addresses, which require a full restart of postgresql to update.
Switch to repmgr managed replication slots. I believe these have been setup manually in the cluster. repmgr can handle creating them on the master, and on failover.

@_stark or anyone else who's interested, let me know what you think of this. I'm free to chat to go over any questions/concerns/alternative plans.

In chat we discussed changing 4-5 into:

move applications to new pgbouncer nodes
verify that old pgbouncer nodes are idle and deactivate them
enable repmgrd with failover=manual
after observing repmgrd is stable in production disable failover=manual

During 1-3 a failover would be executed:

By using repmgr promote and repmgr follow plus running chef knifssh commands similar to how the current setup was arranged to point clients to the new database address. The new pgbouncer nodes would automatically pick up the new database but the clients would continue bypassing them.

During 4 failover would be the same but some clients using the new pgbouncer nodes would need to be excluded from the knifessh commands?

From 5 onwards until repmgrd is activated a failover would only require repmgr commands and no knifessh commands should be needed.

After 7 we might want to schedule a manual failover to be sure it actually works...

marked this issue as related to gitlab-org/omnibus-gitlab#2790 (closed)

mentioned in issue #2818 (closed)

DB HA in production

Designs

Child items ...

Activity

Admin message

Admin message

DB HA in production

Relates to

Activity