Reverse engineer database failover and reapply to staging

when can we schedule this drill to happen?

When we have the load balancers for the secondaries in place. Doing a failover before that is possible, but requires downtime and manual updating of the secondaries as configured in the application.

@yorickpeterse ah, so does this mean that right now we don't have a failover mechanism that actually works?

While we get the LBs in place could you review that documentation is up to date and that we can recover replication following the runbook?

@pcarranza

ah, so does this mean that right now we don't have a failover mechanism that actually works?

Technically it works, but the application will "lose" a secondary in the process because it will be promoted to the primary. As far as I can tell outside of that it should still work

The technically scares me

Since we're hiring a consultant to help us out with database work, I think we should take the following steps:

During the health check we explain our current setup, and discuss what we can do to make this work in the best way
Based on this we either test/improve the existing setup (corosync, etc), or we implement a new setup (given it's better)
We test this in staging first, using 1 primary and at least 2 secondaries (in case we need quoroms, etc)
Once we're confident we set this up in production
Once set up we need to test this in production (as production isn't staging), this needs to be planned ahead as it might cause downtime

If there is team bandwidth this week, then once #1330 (closed) is complete, should we try this test a.s.a.p.? I would think the answer is yes.

@yorickpeterse can we get to that meeting with a reverse engineered explanation of what out setup is? I think it's dug somewhere in the cookbooks, but I'm not confident it is properly explained/documented anywhere.

@ernstvn this has to happen ASAP, I'm just not sure we have team bandwidth to take on it this week. Also I would like to start with the deprecation path for the old Runners ASAP too.

@ernstvn #1330 (closed) requires that we add support for configuring extra details in the application. This in turn is a new feature, which we can't release until 9.1. So I don't see that happening this week.

@pcarranza

can we get to that meeting with a reverse engineered explanation of what out setup is? I think it's dug somewhere in the cookbooks, but I'm not confident it is properly explained/documented anywhere.

That was my plan.

added critical label

@pcarranza reasonably pointed out that this issue should be resolved in the course of discussing HA / failover practices with @northrup and @sfrost. Yes?

@ernstvn @pcarranza

So the first thing we need to do here is to get staging back in shape. This means:

Getting rid of some replicas since we don't need 3 (at least 1 is enough): https://gitlab.com/gitlab-com/infrastructure/issues/1439
Make sure replication and all that is set up properly using Omnibus (right now it's still using a manual setup)
Set up pacemaker/corosync the same way as we use it for production

Once done we can review staging with @sfrost , run tests and what not, then move on to production.

@yorickpeterse which of these steps can you tackle vs. which ones do you need help from a production engineer?

@ernstvn Steps 1 to 3 would need to be done by a production engineer. Testing and such probably won't require a production engineer (at least not for staging).

@jtevnan do you think you could help in here?

This would include reverse engineering our current corosync failover process.

changed milestone to %WoW ending 2017-05-17

Definitely!

corosync failover ftw!

@jtevnan let's reduce the scope of this issue to only replicate our failover mechanism in staging.

Does this make sense?

changed title from Fire drill postgres failover and replication recovery to Reverse engineer database failover and reapply to staging

Ok, I just did change the scope.

@jtevnan were you going to pick this one up?

assigned to @jtevnan

yes: i will start as soon as i can

mentioned in issue #1448 (closed)

@jtevnan Can you please update the expected timeline on this issue?

Once https://gitlab.com/gitlab-com/infrastructure/issues/1573 is resolved and the redis stuff is stable, i will tackle this. Hopefully today.

Thanks for the update @jtevnan

changed milestone to %WoW ending 2017-05-23

mentioned in issue #1573 (closed)

The corrosync setup is non-existent, and has to be built from scratch: at the moment we dont really have a db cluster at all: I am creating a corosync+pacemaker cookbook for this

@jtevnan I can find this: https://gitlab.com/gitlab-cookbooks/gitlab-pgsql-ha/blob/master/recipes/default.rb

Could you explain why is that non-existent ? (not saying that it is a good solution, just trying to understand the context of the resolution)

I should have been more explicit in my statement:

The resource which is being used here is not a failover resource: it essentially just stops and starts the db and is very fragile.

It is also missing tests of any kind, which make it dangerous to implement.

mentioned in issue #1387 (closed)

changed milestone to %WoW ending 2017-05-30

This is blocking https://gitlab.com/gitlab-com/infrastructure/issues/1573

It seems that the settings (https://gitlab.com/gitlab-cookbooks/gitlab-pgsql-ha/blob/master/recipes/default.rb#L9) never made it to our chef-repo. There are no data bags in the git log either, so i will cobble the settings together for a staging as best i can.

changed milestone to %WoW ending 2017-06-06

The included corosync configuration breaks the cluster setup, which is making testing very difficult.

I am rewriting the corosync configuration, but its taking a bit more time than expected because systemd is so good at hiding errors in logs.

https://gitlab.com/gitlab-cookbooks/gitlab-pgsql-ha/merge_requests/8

Brain dump of how the failover works

Azure side:

- DBProdLB 
  - IP: 10.46.2.19
  - Backend pools - db1  (10.46.2.4)
  - Health check - every 5 sec check tcp port 5432
  - accept ports:
    - 6432 (pgbouncer) and forward to 6432
    - 5432 (native postgres) and forward to 5432

Corosync side:

  - start action:
    - gitlab_pgsql_monitor
    - gitlab-ctl start
    - open port 5432

  - stop action:
    - close port 5432

  - monitor action:
    - verify that port 5432 is open

mentioned in merge request www-gitlab-com!6292 (merged)

This issue will be closed in favor of a https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1807 which essentially will be the collaborative future of postgres ha.

things we learned from this issue: we know HOW the current failover works, and have also cleaned up the corosync/pacemaker side of things, however this stuff will never be packaged and sent out in omnibus. Also because the current state of production is broken, we will have to rebuild the HA setup anyway.

With these two givens and the knowledge that we should work closely with the build team on this, we are opting to close this issue and move forward on the other one.

closed

Thanks for the update @jtevnan. Just to clarify, is your prior comment https://gitlab.com/gitlab-com/infrastructure/issues/1460#note_31387405 the state of our knowledge of how failovers currently work? If so... can you add it to relevant part of docs / runbooks asap?

@jtevnan So does that mean we currently can't do failovers in production? If not, how long do you estimate it to take to get a working failover solution on GitLab.com?

Discussed with @jtevnan @yorickpeterse on our Database call right now.

There is currently not a great way to do failovers in production. Pacemaker / corosync do not work on production as intended due to how each db shows as a cluster and Azure load balance is only "aware" of one (db1). Actions:

@jtevnan and @yorickpeterse to add to runbooks https://gitlab.com/gitlab-com/runbooks/blob/master/howto/postgresql.md#triggering-a-failover with more steps on how to do a manual failover in case something breaks right now.
@jtevnan and @ibaum continue to work on gitlab-org/omnibus-gitlab#1807 (closed) for the more robust solution, playing in staging later today, plan to ship in 9.4
@jtevnan / @yorickpeterse consider switching off corosync / pacemaker on .com now?

mentioned in issue #1970 (closed)

Reverse engineer database failover and reapply to staging

Previous issue

Fire drill postgres failover and replication recovery

Designs

Child items ...

Activity

Admin message

Admin message

Reverse engineer database failover and reapply to staging

Previous issue

Fire drill postgres failover and replication recovery

Activity