We can probably eliminate the Shared Disk Failover approach, as that would constitute a single point of failure, so not really be HA.
I'm wary of Statement-Based Replication Middleware or Asynchronous Multimaster Replication as they may require conflict resolution. But that may be possible to automate so I won't rule them out just yet.
https://github.com/nanopack/yoke was the original root of the HN discussion mentioned earlier. It has not been updated in over a year, so I'm a little hesitant.
I believe .com is using corosync/pacemaker.
https://bitbucket.org/openscg/pgha -- It too is not super active. It's written in perl, so we'd need to start including perl with the omnibus package, as well as consider the possibility that we may need to write some perl if we need some work done on it, and upstream is not responding.
So as we discussed it seems that postgresql active/standby setup with pacemaker including VIP (floating ip) is currently the best option, see the following for our implementation on gitlab.com:
Currently gitlab-pgsql-ha will install a custom pacemaker resource called gitlab_pgsql which you should not do, instead use ocf::heartbeat:IPaddr2 for the floating ip.
So next steps should probably be (in no particular order):
Evaluate pacemaker's floating IP options. Since pacemaker is already being used, this should be easier to adopt. But, needs to be tested for stability, and functionality.
If this isn't going to work, need to identify other potential solutions
Incorporate the gitlab.com HA setup within the omnibus package. Based on our conversation earlier, I don't see any reason to deviate from what they've done.
This is already automated in Chef, so just need to pull in the relevant cookbooks from other repositories, and test them together.
Come up with a sane default architecture.
All DB nodes running on dedicated instances? Master running with other services and slave off on it's own? I'm leaning towards the former.
Automate administration. Users shouldn't have to interact with individual components directly, at least for basic operations. Either add gitlab-ctl commands, rake commands, or something else gitlab specific. Should at least cover:
status
manual failover, and manual failback.
Adding/removing slaves (probably, but maybe not)
Test, test, test
Install from scratch, upgrade.
Convert to/from HA?
Adding and removing slaves. How long until they're ready?
Automatic failover.
Automatic failback. This would be nice, and doable. It should also be schedulable though. And probably off by default.
So one concern I have with pacemaker, and floating ips in general, is the requirement that the nodes will need to have some sort of proximity. The service IP address will need to be one that is routable to any node in the cluster. So this will rule out geographic redundancy.
This may be a restriction we want to accept though.
@ibaum Another shortcoming of the floating IP model is it cannot be used in some cloud infras, such as AWS, right? That may be something we have to accept but it's worth mentioning.
@ibaum This level of HA is not meant for geo-diversity / disaster recovery, it is meant for High Availability within a data center. For geo-diveristy we would want to look to the GitLab GEO product.
@northrup That seems reasonable. For the initial version then I'll look into automating pacemaker's floating ips.
@dblessing I'll do my best to leave the floating ip/load balancer functionality easily disabled and well documented so users can opt for their own solutions when needed.
@ibaum I see what @dblessing is saying, it does look like this is a pain in the arse as every provider does it differently and has a different API that you have to call in order to invoke the function.
@northrup@dblessing Right, I think initially my plan is to clearly document our expectations. For people who can't use pacemaker, it should be clear what they need to configure on their providers load balancer. It may be worth automating the larger providers (Amazon, Google) offerings down the road. But might be a bit much for the initial versions.
I've been spending some time getting pacemaker and it's dependencies built within omnibus. This has involved some (a lot of) back and forth with dependencies.
I've currently got corosync and it's dependencies building and installed, which does add about 50mb worth of files to omnibus. I'll go back through everything and make sure we're only installing run time dependencies, not build time.
The next step is glib, which pacemaker depends on. The latest stable does have a build time dependency on a newer version of pcre than most distributions have in their repositories. So that would require us to build an omnibus pcre, which also depends on ncurses, which is failing on centos due to wide character issues. I'll spend some time tomorrow walking back the glib version to one that maybe doesn't need so new of a version of pcre.
If possible we should consider an alternative setup that can also help setup Geo (which is a special case for HA, much simpler without the pacemaker, etc):
Tore down my dev cluster today, and attempted to spin it back up from scratch using the code from !1251 (closed). On the primary, I'm running into issues with the listen_address attribute. Once I change that, other calls start erroring that they can't reach the database over 127.0.0.1. I'll need to dig into that tomorrow. Ideally, local commands still talk to the db over the socket file, and only the standby server(s) use the network port.
Marin Jankovskichanged title from Investigate automating PostgreSQL HA setup to PostgreSQL HA setup with omnibus-gitlab package
changed title from Investigate automating PostgreSQL HA setup to PostgreSQL HA setup with omnibus-gitlab package
We expect that the MR gets merged in the current (alpha) form in 9.2. We will also work on the next iteration in the 9.2 cycle. The goal is to have the full simple PG HA setup shipped with 9.2.
@ibaum from reading through the database load balancing feature, it looks like this requires manual intervention to recover. Is that correct, or will pgbouncer avoid this?
@joshlambert First iteration will most likely require manual intervention for recovery. I hope to add more automation in the next pass, and this could include something to handle automatic failover/failback.
Put together a docker-compose file to make this slightly easier. In it's current state, with some manual steps it can spin up a 3 node GitLab instance with:
2 db and one app server.
db nodes are primary and slave, with functional streaming replication
pgbouncer isn't yet enabled.
The image it is using is based off a package from this build
To use this:
docker-compose up -d db_primary
docker-compose exec db_primary bash
gitlab-psql -d template1 -c 'alter user gitlab_replication replication password "PASSWORD"' # Should probably be fixed
gitlab-psql -d template1 -c 'alter user gitlab superuser # Should definitely be fixed
@yorickpeterse Could you talk with @ibaum about the finer details of failover and if it is possible to do this automatically, for the next iteration of automated PG HA?
So there are a couple of things worth mentioning regarding database failover. The first one is pretty straightforward: supporting automatic failover handling in a transparent and easy to use manner is very hard. It's hard enough that PostgreSQL doesn't provide anything for this out of the box, leaving you to come up with something on your own.
pgbouncer is not strictly required for HA, it does however makes things easier. For example, if PostgreSQL is unavailable then pgbouncer will block connections until PostgreSQL is back online (if I'm not mistaken), instead of immediately returning a connection error. The same goes for database load balancing, it's not required but again it makes your life easier.
I have thus far only briefly discussed HA with our database consultant (the plan is to look more into this in the next few weeks), but he mentioned that they usually recommend pacemaker + corosync, combined with streaming replication. I recall him advising against tools such as repmgr, though I don't fully recall what the exact reason for this was.
Something that's also hard to get right is to decide when you do a failover. If a database is unresponsive for a very brief period of time (e.g. 2 seconds), do you want to immediately perform a failover? Do you want to perform a failover when disk performance is degrading?
In short, I don't really see any clear expectation/request of how the system should behave other than that somehow it should magically work without causing downtime. Please correct me if I'm wrong here.
The way I see it, the following steps are involved (roughly):
Use more than 1 replica in hot-standby, each with its own replication slot
Use pgbouncer in front of all databases
Use database load balancing (this allows the application to deal a bit better with random connection failures)
Use something like pacemaker + corosync (or other proven technologies) for cluster management
At first, we only fail over if the primary database is unavailable for a certain period of time (e.g. 10 seconds). To prevent bouncing between servers we should have some form of "cooldown period". During this period we don't perform another failover.
We set clear expectations of additional rules that should trigger a failover, this should happen in co-operation with the package team, production, and product managers (mostly to coordinate this with requests from customers)
The first 3 items are already supported by GitLab, though I think setting up replicas still requires a bit of manual work (e.g. you have to start the base backup); but perhaps I'm mistaken.
Items 4 to 6 are much harder. For example, we will be re-evaluating our pacemaker + corosync setup to make sure that:
It works
It's well understood
It's something we can ship
It's something we can use to automate failovers
This process however can easily take weeks. Then we need to package it, train our support engineers, document it, etc. Overall I think this process will take at least 6 months.
Stephen hasn’t really seen any easy/quick-to-setup solution, it’s all complicated
What about using consul for service discovery and health checking?
Stephen hasn’t used consul much for health checks and such for PostgreSQL, wonders if it makes sense
Are there any tools that provide more advanced failover “rules” (e.g. fail over if the disk of a primary is too slow for N minutes)
What kind of health checks would you recommend besides just running “SELECT 1” periodically?
Depends on the failure cases. Examples:
database panics and stops (a simple port check catches this)
Userland stops accepting connections, but the port still remains open (a “SELECT 1” of sorts will catch this)
One option would be to force a write transaction, checking the whole system (e.g. a broken disk). Downside: burning through transaction IDs. Useful if executed not too many times.
Most people use a combination of a port check and a “SELECT 1” as this catches most errors.
Using external metrics (e.g. transaction count per minute) and ensuring this doesn’t drop below a certain threshold
Split brains
Right now we immediately terminate the old master processes during a failover
This has to be handled very carefully as split brains are a nightmare
Can we perhaps compose a list of pros/cons of the various solutions, the complexity, risk, etc?
Good ones out there are corosync + pacemaker, and haproxy. They all do have their own quirks and challenges
repmgr: tries to be simple, Stephen’s concern is that it doesn’t cover all the bases because of it.
PAF: looks interesting, but Stephen hasn’t really played with it.
Pacemaker + corosync: generic solution that can be used for other systems, most used by Stephen
Consul: Stephen hasn’t really used it, but overall it looks like a good tool; doesn’t cover the same ground as corosync + pacemaker though
Zookeeper: Stephen heard of it, but never used it
To summarise: a combination of pacemaker + corosync is a commonly used and proven combination, albeit a bit of a complex one; then again HA is complex in general. Since we already use pacemaker + corosync for GitLab.com I think it makes sense to also try to ship this to our customers, instead of trying to cook something together ourselves or by relying on less frequently used pieces of software. I personally don't know enough about pacemaker + corosync to really judge this setup, but I'll take the word of our DB consultant seeing he has used this stack very often.
I recall some (I think @ibaum ?) stated in the past that shipping pacemaker + corosync is difficult, and I'm not sure if the licenses of these software would allow this (I haven't checked what licenses are used). As such we should first investigate this.
For whichever solution we want to go with, I think for the initial release, we should rely on the versions in the distributions repositories, and have a recommended configuration for users. But we should probably choose something we want to incorporate, to make transitioning to a version included in omnibus easier.
On the first run of this ticket, I looked into incorporating pacemaker + corosync. Both projects use the BSD license, so license shouldn't be an issue. Preliminary testing did show they were going to add a decent amount of space to the omnibus package. I don't remember the number exactly, but it was enough to cause some alarm. I'll dig up the old work, and rebuild and see if I can get a more exact number.
@marin Already did some work getting HAproxy in omnibus in !1092.
For both options, the configuration is onerous, but once up and running they have pretty solid track records. I have experience with both, but more with HAproxy. Really the only issues I ever had was them was when I needed to make changes.
Consul is interesting. I believe production team is using it for other purposes. @omame@northrup can you add some information on this?
@joshlambert Amazon doesn't really give us a whole lot we can do outside of Amazon. If customers are looking to run inside Amazon, there is nothing wrong with using the Amazon provided HA db instead of our bundled version.
@marin@yorickpeterse I did a quick test of postgresql reboot behind pgbouncer, as this is something Marin and I had discussed. With a constant stream of requests, if you just restart the main postgresql server, pgbouncer returns errors and GitLab 500's. Pgbouncer does provide a pause, and resume command for this, but with a constant stream of requests, the pause never went through until I canceled the request. I'll do some more research, but it does look like the best approach for a smooth restart will involve manual failovers/failbacks.
@ibaum I was just curious if we could get some take aways from how Amazon is operating their HA Postgres service. They clearly are able to offer a relatively turnkey HA solution for Postgres.
It is expected that HA for database is not simple but I am a bit saddened that PG does seem to make this a bit more complicated.
repmgr: tries to be simple, Stephen’s concern is that it doesn’t cover all the bases because of it.
@yorickpeterse do we know any more details on repmgr and what bases it doesn't cover? We're obviously not going to make things better with complicated setup with corosync + pacemaker, so might be worth considering a simpler tool with limits.
Consul is interesting. I believe production team is using it for other purposes.
I love Consul and I believe it does have a place in our HA stack. However, I am a bit vary of using it for this purpose. @ibaum I would rather want to know about 1st and 2nd option bit more in depth.
@joshlambert Amazon is not trying to ship this to the outside world so they only have one focus point :)
With a constant stream of requests, if you just restart the main postgresql server, pgbouncer returns errors and GitLab 500's.
That doesn't sound quite right, but perhaps we need to enable something for it to automatically block connections (or perhaps I misread).
Regarding haproxy, if we already deploy pgbouncer then I don't think haproxy is necessary as you can use pgbouncer for it instead. For example, you can on the fly update connections details and it will reconnect (not sure if it first allows queries to finish though).
@marin Regarding repmgr, what I remember from looking into it is that it's mostly a monitoring solution and you still need to handle failovers yourself (that is, repmgr only triggers it). This isn't very different from pacemaker + corosync. Having said that, I do think we definitely need to investigate it just so we're aware of what it can do.
Had a call with @ibaum about the state of things and the plan. Based on this I believe we need to do the following:
Give repmgr some testing so we are better aware of its capabilities, since I don't want to dismiss using it just yet
We will use HAProxy in front of pgbouncer, then update HAProxy during a failover so it points to a different pgbouncer/database
We still need to walk through our current pacemaker + corosync setup, set it up in staging, evaluate it, etc. I doubt we will get this done before 9.2.
The dependencies of pacemaker/corosync is still a concern at this point. It will be EE only, but it will still be a noticeable growth in package size.
The concern of repmgr on its own is there still isn't a load balancer in front. Repmgr can handle monitoring and automatic failover but the app still needs to know who to talk to. There are some workarounds:
DNS updates at failover
On the fly update of pgbouncer's backend
...
With HAProxy on its own, it can detect and evict a failed node, but it can't really handle failover/failback.
So for simplicity, repmgr seems to be a good option. We can work around the features it doesn't provide. But a sweet spot would seem to be an haproxy/repmgr combo. Unfortunately, that requires shipping two more pieces of software with GitLab.
I'll spend some time today checking out repmgr and haproxy both on their own, and working together.
https://gitlab.com/ibaum/ha updated with an lb container for testing out haproxy. Base configuration in there works. The time it takes to consider a node down is long, but that can be adjusted. Other than that, shutting down the primary node causes haproxy to reroute traffic automatically to the secondary.
Will need to do some more in depth testing to see how it handles existing sessions.
Preliminary testing with repmgr did not go as well. I was not able to get the ubuntu repository version of repmgr to work at all, I believe it's a path issue related to our PostgreSQL being in an unexpected location. Building from source was relatively easy (points for that), but following the instructions I was only able to get the first server initialized. Attempts to clone a standby would generate the error: [ERROR] parameter 'wal_level' must be set to 'hot_standby' or 'logical'. This setting was correct on the primary. Since the secondary hadn't been initialized yet, it didn't have any settings, as expected. I tried just creating a barebons postgresql.conf, but no luck. Spending some time tracing this has so far not met with any success.
If we just move forward with HAproxy alone, we have to contend with the following:
If the primary server goes down, HAproxy will just route traffic to the secondary. We still need a process to
alert the secondary that it is now the primary
restore the former-primary to the cluster
Also, with the way postgresql is setup, it does seem to assume that if you failover, you will just use the new primary until it fails, rather than failback once the busted server is returned. This will be a bit of an issue with HAProxy. The only balance mode which makes sense for us is first. However, if the primary fails, and we return it to service as a standby, then we'll have issues as haproxy will see it as back online, and try and route traffic to it. Adding repmgr alongside haproxy will not fix this.
We don't necessarily have this issue if we go with repmgr on it's own, but that does still leave us with the issue of where the application should be routing database traffic to.
I was able to get repmgr working by building from source, rather than using the debian repository version. Manual failover worked. Still need to come up with a decent process for returning a failed node to service.
Also need to check repmgrd to see how well that works for monitoring and automatically failing over.
Confirmed today that the requirements for adding pacemaker/corosync to the package are probably too much. This is largely due to the size it would add to the package. On top of that however, it would add about 10 pieces of software to the package due to dependencies. Which we would then be responsible for.
Haproxy is still a mostly viable candidate (syslog requirement is troubling) for the load balancing aspect. @marin also pointed out that nginx can do tcp load balancing now. As we already package that, it is probably worth doing some testing.
Did some research into Nginx as a TCP load balancer. None of the load balancing methods are really compatible with the active/passive model. However, we could probably overload the on the fly configuration method to allow for one master at a time.
This isn't my favorite option at the moment.
From today's call Stephen brought it up that we should make sure that when we do a failover we fail over to the most recent replica, and not just a random one. Say we have two replicas with these positions:
+1
+2
If the primary was at +3 and we fail over to +1, then the +2 replica will refuse to replicate from +1 because +2 is further ahead. This in turn means you'll have to rebuild some replicas depending on whether they're ahead/behind/etc.
This means that during a failover we essentially need to take these steps:
Block all queries to the primary (in case of a controlled failover)
For all secondaries get the WAL position
On a random secondary, compare the WAL pointers using PostgreSQL's built-in functions to get the greatest position
Promote whatever host is the furthest ahead as the new primary
Update any Azure load balancers/whatever
Stop the old primary (as in, physically stop PostgreSQL) to prevent split brains
Testing of repmgr to stand up a standby server, and do a manual failover went well so far. See gitlab-com/infrastructure#1959 (closed) for the details. We'll do one more failover on staging to confirm that it works, then that should be checked off the list.
The next step is to evaluate repmgrd for orchestration.
I did a quick run through of the repmgr docs regarding repmgrd. It looks like it uses a table in the repmgr database we create to track the replication status of each standby node. Each node will periodically update the master node with its status. In the case that the master fails, it uses an algorithm to determine the best candidate for the new master, with each standby pulling the data out of their read only copy of the database. The nodes themselves then handle promoting themselves, or following the new master depending on what's appropriate.
Concerns:
Relying on the database we're managing for tracking node states.
There isn't really an election type process where the remaining nodes make sure they can communicate, and then vote for a new master. They do ensure the other nodes are reachable at least. After that they use an algorithm, (plus the priority field), to calculate the appropriate master. We'll want to take advantage of a witness server to help reduce the risk of a split brain scenario.
Check the log file, make sure there are no errors.
After that, shut down postgresql on the primary with gitlab-ctl stop postgresql. Watch the log file on both standbys, they will count down from 60, then one of them will choose to be the primary.
Issues:
By default, repmgr uses pg_ctl -w -D /var/opt/gitlab/postgresql/data -m fast restart to try and reboot the servers, which doesn't work. I did have to manually restart with gitlab-ctl restart postgresql. That command requires root access, due to permissions on /opt/gitlab/sv/postgresql/supervise, but repmgrd runs as gitlab-psql.
Did some tracing on the pg_ctl restart issue. The service is in fact restarting, pg_ctl just never registers the shutdown, so eventually exits non-zero.
So the automated repmgrd test was quit successful, and if we can connect it to pgbouncer (as mentioned above) it can greatly ease the failover process.
Only reservations is how the new primary is chosen.
Next steps: get the packaging side of things up and running, and planning out automation and setting configuration bulkheads (e.g. what could be automated - pg_hba.conf vs what should not be automated - adding shared libraries)
So #2477 (closed) should also be fixed or worked around before we progress.
repmgrd runs as the gitlab-psql user, and just needs a method to restart the database. Currently, running gitlab-ctl restart postgresql requires root access.
I currently have a test build that includes repmgr as an EE dependency, the MR is !1638 (merged). It installs the software only, no configuration is yet done.
Most of the steps here can be done outside of the current automation, and we don't need to worry about gitlab-ctl reconfigure overwriting anything. The exception is the pg_hba.conf modifications. Those should be the next things to fix. From there, we could put together some documentation on how to get a GitLab database cluster managed by repmgr.
For the next level of automation we need to add a repmgr cookbook that:
Could be in the repmgr cookbook or as an gitlab-ctl command:
Registering a node as a master: I think this really should go in the cookbook.
Registering a node as a standby (I think this should include cloning). I think this is fine in the cookbook, but should refuse to do anything if there is already data in the directory it is pointed to.
As an gitlab-ctl command:
Converting a failed master to a standby
Manual failover/switchover.
Disabled by default. This isn't really useful (and is potentially harmful) without some sort of Load Balancer or proxy in front. ↩
For reference, #2488 (closed) was done to allow custom entries in pg_hba.conf, and was merged for 9.3. This can be used to manually get repmgr working with omnibus. We'll piggyback on this as well with the repmgr automation.
I spent some time with the repmgr doc on using repmgr and pgbouncer to isolate failed nodes. So the way it would work is:
pgbouncer would run on application nodes
the application nodes would be configured to talk to the local pgbouncer instance for its database connection
Each pgbouncer node would be configured with the master database as its backend
repmgrd would run on the database nodes
In the case of a failover, repmgrd would also need to
Edit an ini file available to each application node to point to the new master(I'm not a huge fan of this step)
Send a reload command to pgbouncer.
I think I'd be happier if
The application servers saw the failure and updated themselves
or better yet
The failover was completely transparent to the application nodes
We could alleviate the ini file issue with a shared location (NFS, etcd, what have you).
So this probably brings a load balancer back into the conversation. Both Haproxy and Nginx support dynamic backends for TCP load balancing. The Nginx implementation was a little cleaner, and we already include Nginx in omnibus. So if we test, that should probably be the first one we try out. But that would bring us closer to the transparent failover.
First pass of repmgr automation has been merged. There is a WIP MR to get the documentation updated as well.
Next steps (not necessarily in order):
Orchestration testing. Need to devise some more precise options for the different options, then do some testing on how well they actually work.
Repmgrd automation. This shouldn't be too complicated to add onto the existing repmgr automation. We also might consider adding this before, or at least independently of the orchestration. Users who already have some sort of load balancing in place could take advantage of this earlier.
Application communicates with pgbouncer for its database connection
pgbouncer communicates directly with master node
On master failure, repmgrd updates pgbouncer instances with new master
repmgrd/pgbouncer/{nginx,haproxy}:
Run pgbouncer on application nodes
Application communicates with pgbouncer for database connection
Pgbouncer communicates with load balancer
Load balancer communicates with master node
On master failure, repmgrd updates load balancer with new master
For both paths, the method for repmgrd to inform of updates is still a bit up in the air.
Using ssh keys to allow repmgrd to update other nodes is an option
Pros:
It should work with our existing package. No new software needs to be added.
We do not need to grant root access to repmgrd. Everything it needs should be able to be done by the sql user.
Cons:
We enter an unknown state if notification to a node fails.
Utilize a key value store that repmgrd will update on new master promotion. Application or load balancer nodes will periodically check for a change and update their config accordingly.
Pros:
Nodes are responsible for ensuring their own state is up to date
This issue has gotten a bit unwieldy at this point in time, and most of the work has been completed. I've opened up issues for the remaining tasks, and I will close this in favor of those.
The current status is
We have a documented process for creating a multi node PostgreSQL cluster using the standard omnibus tools.
The cluster is active/passive, with one master node
In case of master failure, manual failover is possible