Few times I've hit a case where an old master returning from being down still gets marked as master. This leads to having multiple masters and causes issues.
Good thing is that we automatically stop pgbouncer from running when we see more than one master so rule out any damage.
The bad thing is that this basically causes downtime.
We should look whether we can mark an old master unhealthy when it returns to rotation.
So currently, the script to check if a node is the master runs the query SELECT name FROM repmgr_gitlab_cluster.repl_nodes WHERE type='master' AND active != 'f' against the gitlab_repmgr db.
It does look like if a failed master is brought back online, the active column does not get set to t. So while gitlab-ctl repmgr cluster show will show two masters, there will in fact be only one.
I'll triple check to confirm that the script works as we expect, then make a note on the documentation that this is the behavior users can expect to see.
So my goal is that each node is capable of independently determining whether it's the master or not.
Currently we're just basing that off of the local contents of repmgr_gitlab_cluster.repl_nodes.
I think if we add a check on the output from gitlab-ctl repmgr cluster show, that should do it. If it shows only one master, then good. If it shows two masters, and we see active = 't' for both, then it can't be a master.
So in thinking about this, it definitely isn't an easy problem. I still think event notification is the way to go, but it does leave us with a similar problem for failover. Who is responsible for removing, and how do we ensure it happens.
I think adding another watcher to the db nodes consul agents is the way to go. When the repmgr event occurs, we write an entry in the consul k/v store. The watcher on the remaining db nodes will see the new value, try for a lock, delete the node from the cluster (and the consul service) and then delete the value when successful. If it fails, release the lock but don't delete the value. All operations should be idempotent.
I think having all nodes try and write to the k/v is the only loophole here. It's tough to guarantee that at least one will succeed. But since they write to the local consul agent, I think it is an acceptable risk.
Just remembered I mentioned this on the call and planned to drop this
note in the issue but forgot:
We currently cannot use pg_rewind in production because we don't have
checksums enabled and don't have wal_log_hints enabled. To my surprise
checksums are actually disabled on even new installs (at least looking
at the GDK database).
I would suggest we should enabl enable checksums on new databases.
This will increase the volume of wal generated but to me it seems
supporting many client installs on hardware we don't control is
exactly the kind of situation where checksums are most valuable. The
busiest instances where the extra wal volume would hurt the most are
also exactly the instances where the extra costs would be easily
justified.
And I would suggest we should enable wal_log_hints at least on
databases where HA is set up. The benefit of being able to use
pg_rewind if we wish to -- which I believe repmgr is capable of doing
automatically -- seems compelling even if we don't make it the normal
path.
As an aside I think Stephen Frost was suggesting relaxing our WAL
settings to allow longer time between checkpoints. If we did the two
at the same time that would probably more than make up the difference
leaving users with less WAL overall (though not as much savings as
they would have had without having wal_log_hints enabled).
I think if we add a check on the output from gitlab-ctl repmgr cluster show, that should do it. If it shows only one master, then good. If it shows two masters, and we see active = 't' for both, then it can't be a master.
@ibaum is it possible for the failed master to detect this itself? The failed master boots itself up, checks whether t stands next to its name. If it has f, demote itself? This way we limit the responsibility to the node that is in inconsistent state and leave the rest of the cluster out of this decision.
@marin The data a failed master will have when it comes back online will be from when it was the active master. So unfortunately the other nodes do need to be involved.
Played around a bit with the consul kv store today.
On a repmgr promotion event, the remaining nodes should put into the kv something like gitlab/postgresql/$failed_node_name with no value. Consider looping until it succeeds. For a basic put, writing data which is already there is considered a success, so no harm in having all nodes try.
We can then use a keyprefix watch on gitlab/postgresql
Since there is no builtin repmgr command to remove a master, (see !1980 (merged)), the actual DELETE query needs to happen on the new master node. Need to do some testing on this. When a watch picks up an event, the new master should be ready, and we can execute the remove command there. But there should be some mechanism in place to have it retry on failure, it should not wait for a new event to occur.