As a first step before we can execute a fire drill, we need to understand how the failover mechanism is set up in production and replicate it in staging.
For this we need:
2 databases in staging, one working as a primary, another one working a secondary following primary.
Corosync setup so we can trigger a failover
All this properly documented.
With this, we can perform drills in staging to get comfortable with the setup. Then we can talk about performing a failover in production.
Fire drill postgres failover and replication recovery
With all the changes that have had happened to the staging and production cluster with load balancing I think we need to fire drill performing a failover first in staging, then in production when we are confident that we will survive it.
The reason why I think this is a critical thing is because:
The infrastructure has changed a lot.
Our runbooks for recovering replication are outdated.
We have no idea if corosync is actually working as expected.
I would not like to discover that this doesn't work in the middle of a production incident.
We can use this opportunity to:
update our processes and try to automate recovering with a manual trigger
see how it behaves from the monitoring perspective and do it in staging while running a siege to see what can we expect to happen with the customers.
So, @yorickpeterse when can we schedule this drill to happen?
When we have the load balancers for the secondaries in place. Doing a failover before that is possible, but requires downtime and manual updating of the secondaries as configured in the application.
ah, so does this mean that right now we don't have a failover mechanism that actually works?
Technically it works, but the application will "lose" a secondary in the process because it will be promoted to the primary. As far as I can tell outside of that it should still work
@yorickpeterse can we get to that meeting with a reverse engineered explanation of what out setup is? I think it's dug somewhere in the cookbooks, but I'm not confident it is properly explained/documented anywhere.
@ernstvn this has to happen ASAP, I'm just not sure we have team bandwidth to take on it this week. Also I would like to start with the deprecation path for the old Runners ASAP too.
@ernstvn#1330 (closed) requires that we add support for configuring extra details in the application. This in turn is a new feature, which we can't release until 9.1. So I don't see that happening this week.
can we get to that meeting with a reverse engineered explanation of what out setup is? I think it's dug somewhere in the cookbooks, but I'm not confident it is properly explained/documented anywhere.
@pcarranza reasonably pointed out that this issue should be resolved in the course of discussing HA / failover practices with @northrup and @sfrost. Yes?
@ernstvn Steps 1 to 3 would need to be done by a production engineer. Testing and such probably won't require a production engineer (at least not for staging).
@jtevnan let's reduce the scope of this issue to only replicate our failover mechanism in staging.
Does this make sense?
username-removed-274314changed title from Fire drill postgres failover and replication recovery to Reverse engineer database failover and reapply to staging
changed title from Fire drill postgres failover and replication recovery to Reverse engineer database failover and reapply to staging
The corrosync setup is non-existent, and has to be built from scratch: at the moment we dont really have a db cluster at all: I am creating a corosync+pacemaker cookbook for this
- DBProdLB - IP: 10.46.2.19 - Backend pools - db1 (10.46.2.4) - Health check - every 5 sec check tcp port 5432 - accept ports: - 6432 (pgbouncer) and forward to 6432 - 5432 (native postgres) and forward to 5432
Corosync side:
- start action: - gitlab_pgsql_monitor - gitlab-ctl start - open port 5432 - stop action: - close port 5432 - monitor action: - verify that port 5432 is open
things we learned from this issue: we know HOW the current failover works, and have also cleaned up the corosync/pacemaker side of things, however this stuff will never be packaged and sent out in omnibus. Also because the current state of production is broken, we will have to rebuild the HA setup anyway.
With these two givens and the knowledge that we should work closely with the build team on this, we are opting to close this issue and move forward on the other one.
@jtevnan So does that mean we currently can't do failovers in production? If not, how long do you estimate it to take to get a working failover solution on GitLab.com?
There is currently not a great way to do failovers in production. Pacemaker / corosync do not work on production as intended due to how each db shows as a cluster and Azure load balance is only "aware" of one (db1). Actions: