@pcarranza I'm thinking of using PostgreSQL's built-in WAL replication (which can replicate DDL changes) combined with something like pgpool. Slony isn't something we'd use other than for the 9.2 to 9.5 upgrade.
@pcarranza Yes, this unfortunately won't be an easy process. Even with Slony there are numerous problems. For example, Slony requires two databases to be active: source and target databaes(s). If one uses a single server for their GitLab instance (e.g. most smaller organisations) they will only have a single database running. This means that when GitLab is upgraded some kind of arcane wizardry has to run in order to basically install two versions of GitLab so we can run both the old and new database in parallel. We then have to start replicating data and ideally make the new database the master the moment it's in sync with the old database.
Upgrading will be a bit easier if one can afford 2 servers as this allows them to install a new GitLab version on the new server and then replicate the data between the two servers. It might be easier to settle with this.
I discussed bits of this with @marin and in the worst case we can fall back to only providing pg_upgrade upgrades, but ideally we also ship Slony in such a way that it's at least somewhat usable (e.g. the switching of databases will probably require manual intervention). In the future we can then maybe swap Slony for something better.
In short, this is going to be quite the adventure.
Load balancing happens on a per session basis. Since we re-use connections this means that a node is probably assigned the first time a connection is used, then it's re-used. Personally I would prefer to balance per HTTP request/Sidekiq job or something like that, since it won't pin connections to the same server for too long. Having said that, per connection balancing should be good enough as a start
pgpool supports a few replication techniques, including PostgreSQL's streaming replication. To ensure consistency between primaries and secondaries we need to use a form of synchronous replication (https://www.postgresql.org/docs/current/static/logicaldecoding-synchronous.html). Nothing is more annoying than creating an issue, reloading a page, and seeing the issue is not there yet
pgpool supports automatic failovers and the likes, this may conflict with corosync. An alternative is to use corosync and the likes to load balance entire clusters, while using pgpool to load balance between servers in said cluster
The rationale for using pgpool is that is that we have a large number of reads, while having a relatively low number of writes:
Here we can see that the number of reads (the tuples being read to be specifically) significantly outweighs write operations. This means that we could spread database load around by using 1 primary and a number of secondaries. For example, we would have 1 primary and 4 read-only secondaries. SELECT queries are balanced between the 4 read-only secondaries, writes always go to the master.
/cc @ibaum because of WIP Postgres HA in omnibus-gitlab packages. @yorickpeterse would be great to sync up with @ibaum on this so we remove some double work.
The more I see people advocating for pgbouncer (and advocating against pgpool), the more I think we might need to take a different path. Instead of using pgpool (which is quite a beast configuration wise) I'm thinking of the following setup:
The application knows of two database hosts: a primary and a secondary. The primary is configured using the usual settings, the secondary would be configured separately/optionally. The application looks at queries and uses the secondary connection when:
A query starts with SELECT
The query does not include FOR UPDATE
Everything else goes to the primary.
Under the hoods the primary host and secondary host used by the application are actually Azure/haproxy load balancers. There are thus two load balancers:
A primary balancer
A secondary balancer
Both forward requests to a pool of pgbouncer hosts/processes. The primary pgbouncer pool forwards all connections to the primary, while the secondary pool forwards all connections to the pool of all database hosts (so primary + secondaries). This leads to the following flow of traffic:
Here the secondary pool may use the primary, but that's fine.
pgbouncer in turn uses a DNS record to figure out what hosts it has to forward data to. pgbouncer can check for SOA records and the likes to figure out when it has to update its list of IPs (as it caches DNS records). It's possible this may not be super reliable so we might another solution. However, the point is that during a failover we update whatever pgbouncer forwards to while leaving the rest untouched.
One thing I'm not sure about just yet is if pgbouncer can somehow detect if a node goes down, and if it can do something in that case. My gut feeling is "no". In that case we'd need to probably stick an haproxy between pgbouncer and PostgreSQL, producing the following graph:
In theory it's possible for the application to connect directly to pgbouncer. This however means we can't have a cluster of pgbouncer processes running on multiple hosts, turning this into a single point of failure.
We can use pgbouncer, which according to others is more stable
We can re-use haproxy for load balancing, instead of having to learn how to do load balancing using pgpool
We have more control over load balancing as the logic resides in the application. This means that we could for example redirect queries based on the tables used.
We can use repmgr or other tools to handle failovers/replication
Diving through the code I think Makara might be doing a little bit too much. For example, it supports sticking queries to a master after a write, seems to have some form of failover support, and it also relies heavily on regular expressions for figuring out where to send a query (which is not cheap to run for every query). There also appear to be some issues when using Makara with Sidekiq: https://github.com/taskrabbit/makara/issues/133
Another option would be to have 2 pgbouncers running on every host: one for the primary, and one for the secondary. This means we can cut out haproxy at the start, creating the following flow:
Benefits include:
No need to run a separate pgbouncer cluster (on separate hosts) and load balance between these
Less connection overhead due to fewer haproxies involved, and pgbouncer runs on localhost
Each pgbouncer process only has to handle with the traffic of the local host, spreading load/traffic better
Downsides:
Worker hosts now also have to run at least two pgbouncer processes on different ports/sockets
If a local pgbouncer goes down the application can not connect
A greater number of connections is necessary on PostgreSQL's end as every local pgbouncer needs to have enough connections available
Using all setups proposed above, in the event of a failover we only need to update the primary haproxy's configuration. Because the secondary haproxy forwards connections to all hosts we don't need to change anything here.
The way it looks like we would need to have some form of service discovery to identify the primary and the secondaries. For something like this I can't help but think of DNS as the way to discover to which IP should I connect to get a writable connection. This way we could keep the rails implementation quite dumb and just rely on the service announcing itself.
So this state sharing has to be quite dynamic as things move around: a failover event. And the application needs a way of getting the notification to start finding a new primary.
I think we need to stop and think how does repmgr behaves to include it in the planning to figure out how things will work as a whole.
Regarding the setups, I rather have less moving parts: let's try removing HAProxy from the equation.
Regarding pgpool, I finally got to do load testing after fighting our staging environment for days. Basically everything that could break either broke, or wasn't configured in the first place.
To test things I used siege running on db1.staging (not used for anything else at the moment). I gathered a list of URLs to test and saved these in urls.txt. I ran the test using the following command:
siege -c 16 -i -b -t 1h -f urls.txt
This will request the URLs randomly using 16 concurrent clients, for 1 hour.
pgpool Configuration
For these tests I ran pgpool-II version 3.4.3 on the staging primary directly. Version 3.4 is not the latest version available (3.6 is), but it was the one available in apt; saving me the time from having to build a new omnibus package. I intend to also test 9.6 later to see if there are any differences.
pgpool was configured to run with connection pooling and load balancing, in master-slave mode. No query caching was enabled. The pool settings were as follows:
num_init_children (the number of processes to run): 250
max_pool (the number of connections to cache per process): 8
child_life_time (idle time after which a pool exits): 1800 sec
This can yield a maximum of 250 * 8 = 2000 connections, which equal staging PostgreSQL's max_connections setting.
Because this runs on the primary it will add some overhead/context switching, but it was the easiest way of setting things up. When testing connections directly to PostgreSQL I disabled pgpool.
pgpool was configured with 4 backends: 1 primary, and 3 secondaries (in hot standby mode). All backends were assigned an equal weight.
Vanilla PostgreSQL
When connecting the workers directly to PostgreSQL the output of siege is as follows:
yorickpeterse@db1:~$ siege -c 16 -i -b -t 1h -f urls.txt** SIEGE 3.0.8** Preparing 16 concurrent users for battle.The server is now under siege...Lifting the server siege... done.Transactions: 23796 hitsAvailability: 100.00 %Elapsed time: 3599.50 secsData transferred: 3594.87 MBResponse time: 2.42 secsTransaction rate: 6.61 trans/secThroughput: 1.00 MB/secConcurrency: 15.99Successful transactions: 23796Failed transactions: 0Longest transaction: 21.83Shortest transaction: 0.39
Key metrics:
Average response time: 2.42 sec
100% availability
6.61 transactions per second
Transaction distribution:
CPU usage:
Database load:
pgpool
When connecting the workers to pgpool the output of siege is as follows:
** SIEGE 3.0.8** Preparing 16 concurrent users for battle.The server is now under siege...Lifting the server siege... done.Transactions: 24722 hitsAvailability: 100.00 %Elapsed time: 3599.83 secsData transferred: 3712.65 MBResponse time: 2.32 secsTransaction rate: 6.87 trans/secThroughput: 1.03 MB/secConcurrency: 15.97Successful transactions: 24722Failed transactions: 0Longest transaction: 24.40Shortest transaction: 0.38
Key metrics:
Average response time: 2.32 sec
100% availability
6.87 transactions per second
Transaction distribution:
CPU usage:
Database load:
Here we can see that the load is spread amongst PostgreSQL instances fairly evenly. The load on the primary hovers between 2 and 5.
Conclusions Thus Far
Using pgpool (or any other kind of load balancing solution) we can reduce the load on the primary by 2 to 3 times. In the test case above using pgpool also lead to a decrease in response timings of about 100 milliseconds. When using pgpool each host also only uses about 20% of the CPU, meaning we could (if this pattern is similar for production) downsize the database.
I plan to do some extra load testing involving more writes, fewer hosts (e.g. only 2 secondaries), etc.
Production sits at a CPU usage of around 80% most of the time. Assuming an equal distribution of load amongst hosts we'd need at least 4 hosts (so each can do about 20% of the load). This would result in the primary's CPU usage being reduced by 75%.
Since we need some form of synchronous replication when load balancing (to ensure consistency) I did some basic testing. For this I used pgbench and had it run the following query for 30 seconds:
transaction type: queries.sqlscaling factor: 1query mode: simplenumber of clients: 1number of threads: 1duration: 30 snumber of transactions actually processed: 2394latency average = 12.534 mstps = 79.785918 (including connections establishing)tps = 79.792625 (excluding connections establishing)
I then configured PostgreSQL as follows:
synchronous_commit = 'remote_apply': this ensures a COMMIT does not return until the data is truly visible on the replicas
synchronous_standby_names = '3(*)': this tells PostgreSQL that at least 3 replicas need to acknowledge the COMMIT (all replicas in the case of staging)
Then I re-ran pgbench, which produced the following output:
transaction type: queries.sqlscaling factor: 1query mode: simplenumber of clients: 1number of threads: 1duration: 30 snumber of transactions actually processed: 1463latency average = 20.508 mstps = 48.761235 (including connections establishing)tps = 48.766034 (excluding connections establishing)
Here we can see that our simple UPDATE took about 1.6 times longer when using a synchronous COMMIT for 3 replicas. The time will most likely increase depending on the data that has to be committed.
One thing to take into account here: when you set synchronous_standby_names to 3(*) it will block COMMITs if there are fewer than 3 replicas. In other words, if you ever want to remove a replica you first have to adjust this setting and then remove it. Fortunately PostgreSQL doesn't require a hard restart to apply these settings, a simple SIGHUP will do.
We have been using pgbouncer +HAProxy for 2 years now and happy with the results. Apart from what you have mentioned, one of the advantages is removing/adding new slave servers easily without worrying about impact for users. The only downside for us is that we cannot split role access since pgbouncer connected to Postgres with single role/user.
@saifulmuhajir Are you also using secondaries in hot standby mode for this, or are you just bouncing all connections always to the master?
Considering we also want to sharding in the next few months I'm wondering if using load balancing makes any sense. When you shard tables you basically already balance the load, and you don't need any kind of synchronous replication for this. As such I'm thinking we can gain more by doing pgbouncer + citus (for sharding), and forget about pgpool and custom load balancing solutions; it would also save us code.
Set up synchronous replication in production to measure the overhead:
synchronous_standby_names = '2(*)'
synchronous_commit = 'remote_apply'
Measure the overhead this adds to writes
Next we'd need to set up pgbouncer. We can either include this in omnibus, or just install it using apt-get for just GitLab.com. Initially I think it's better to do the latter as this saves us having to ship/support it officially.
In parallel to this we'd have to update the application to support redirecting of queries. https://github.com/taskrabbit/makara/ could be used for this, but it relies on regular expressions and what not which can have quite some overhead (as we need to run this logic for every query in the worst case). Writing our own logic is annoying but would let us optimise it as much as possible.
Once done we'd need to set up two pgbouncer hosts: a primary and a secondary. Both use DNS records to figure out what IP(s) to send connections to (see all the above schematics). We'd then need to update these records in the event of a failover, which means we need extra machinery for this. At the very least we need to start with a script that one can run manually to perform these steps.
tl;dr: This is going to be quite a bit of work. However, we can sort of count pgbouncer out of the time estimate as we need this anyway; regardless of whether we use sharding or load balancing.
Next we'd need to set up pgbouncer. We can either include this in omnibus, or just install it using apt-get for just GitLab.com. Initially I think it's better to do the latter as this saves us having to ship/support it officially.
@yorickpeterse If this is direction in which we are going for PG HA then it makes sense to have it in the package.The PG HA offering for our customers depends on the decision we make for GitLab.com. If you mean only for testing purposes then it makes sense to first see how things will behave before we make the jump into shipping.
Having discussed this in the infra call today we'll go with load balancing first, since we're still a few months away from sharding. One solution we can apply (similar to makara) is to set a cookie after a write. Once this cookie is set (and until it expires), all reads go to the master. The timeout for this should be set reasonable based on our replication lag. Synchronous replication would work and ensure data is more consistent, but also means that a slow replica can have a really annoying impact on writes; potentially leading to timeouts.
Another problem with the cookie approach: say you create an issue and you want to send an Email to mentioned users. In this case a cookie won't help as that only works for Unicorn. This could lead to a Sidekiq job trying to find a row that's not present yet, thus failing (or just discarding the job). A workaround would be to have Sidekiq always use the master. However, a decent chunk of our load comes from Sidekiq so this would be less than ideal.
So to summarise, to deal with replication lag there are basically two solutions:
Store a cookie with a 30 second timeout (or something like that) after a write. Read from the primary while this cookie is valid.
replication delays don't affect writes
a replica not responding won't result in writes being blocked (depending on the value of synchronous_standby_names)
in the event of a failover it is possible for newly created data to be lost due to asynchronous replication
does not work for Sidekiq, and having Sidekiq always use the primary is not ideal. On the other hand, Sidekiq usually does mostly writes anyway so this might not be that bad
queries will use the primary until the timeout, even if the data is already available on the secondaries
sharing a link to a newly created resource (e.g. an issue) may result in others running into a 404 as they don't have the cookie set
Use synchronous replication
works for both Unicorn and Sidekiq
no cookies or arbitrary timeouts required
less chance of data loss as COMMITs don't complete until a number of secondaries (depending on the configuration) have the data
writes are slowed down depending on the amount of secondaries that have to confirm a COMMIT
replicas not responding can lock up writes until they time out
I'm leaning towards the following:
Asynchronous replication, perhaps in the future with 1 replica having to acknowledge writes mostly to reduce chance of data loss (unrelated to load balancing)
Use a cookie for Unicorn after a write, cookie TTL somewhere between 10 and 30 seconds
@yorickpeterse we use standby servers as well. On some databases, we have two-three slaves and put them in connection pooler as well as the master and split the loads. Actually, the initial purpose of using pooler was to handle connections from apps which can reach up to 3K per seconds. Yes, we have lots of apps. :)
Once done we'd need to set up two pgbouncer hosts: a primary and a secondary.
We actually put both primary and secondary on one machine by differentiating port number. Thankfully, so far so good.
@saifulmuhajir How do you handle load balancing in that case, is this done using custom application logic or are you using something that already existed?
Having discussed this in the production call, we'll go with the following setup for load balancing:
In the Rails app we define a list of hosts, of which one may be the primary
Upon boot we detect which host is the primary and mark it as so
SELECT queries are round-robin'd between all hosts
Writes go to the primary
If a write fails because the master is now a secondary, we figure out the new primary and retry (up to 3 times)
For GitLab.com the host names configured are actually pgbouncer hosts, one for every DB host
Adding or removing hosts requires updating the Rails configuration using Omnibus
This is PostgreSQL specific and disabled by default. MySQL will not be supported
A write will set a cookie/session variable that times out after a few seconds. While this cookie/variable is set all queries go to the primary
An extension to this would be to dynamically pull the secondaries from the primary (configured in database.yml). This means we don't need to update Rails configurations (as long as database.yml uses a floating IP for the primary), instead we just restart Unicorn/Sidekiq and everything will "just work"
Shipping wise this means we only need to ship pgbouncer, and maybe repmgr in the future. No haproxy is necessary.
One addition: the host defined in config/database.yml will be treated as the primary and is expected to always point to the primary, even after a failover. For GitLab.com we already use an Azure LB so this is taken care of. For the additional hosts configured it's possible one of them may also point to a primary (which one can change after a failover) but this is OK; as alll will be candidates for read queries.
@yorickpeterse in our case, instead of using each pgbouncer for each secondary, we put all secondary in one pgbouncer and put HAProxy on top of them to manage the load for each. With this configuration, if one secondary is behaving badly or worse than the other, the load on the particular host is decreased and so on.
But, if GitLab's rails can balance the load and round-robin'd the queries, I guess that'll work just fine.
I did some basic testing using pgbouncer (installed via apt-get to keep things simple) in our staging environment. Pgbouncer was configured to run on the primary directly (for the sake of simplicity) and connect to localhost. A full list of the settings can be found at pgbouncer.ini. Note that these settings are pretty basic and perhaps not the best, I mostly focused on making it work and not making it work well.
Using pgbouncer in transaction mode seems to work fine, and the primary only has around 45 active backends instead of ±100. Weirdly enough commands such as SET shouldn't work nicely with transaction pooling, but it seems to work fine thus far. Perhaps in my tests I was just assigned the same connection. Either way, transaction pooling is more beneficial performance wise but may result in us being unable to run SET statement_timeout during migrations. I guess what we could do is have a separate Pgbouncer running that uses session pooling and use this when running migrations. This however is quite a bit of a hack, one of which I'm not sure if we can introduce it in a transparent way.
Session pooling would work around the above problem. However, session pooling means we need the same amount of PostgreSQL backends as we have now. As such this setup would not allow us to drop PostgreSQL's max_connections setting.
Similar to pgpool, pgbouncer also has its own authentication mechanism. Supposedly you can make it use PostgreSQL directly but I couldn't get this to work, instead I had to store the username/password md5 in a text file. Regardless of what setup we end up using, pgbouncer will require some extra configuration for credentials.