Adding consul to EE build
Merge request reports
Activity
added HA postgresql labels
Currently, setting
consul['enable'] = true
will enable a consul agent on the node. Addingconsul['configuration']['server'] = true
will make consul run as a server. It will be standalone unless joined to a cluster.No security has been done yet at all. I don't think SSL out of the box is feasible, but we should be able to take advantage of the builtin gossip encryption. We can setup attributes to allow SSL as well. ACLs also need to be investigated.
mentioned in issue gitlab-com/infrastructure#2352 (closed)
Added a first pass at adding service definitions and checks. Currently it does not work 100% in my dev environment.
With this set of changes, setting
consul['configuration']['server'] = true
ends up with an error:undefined method '[]=' for nil:NilClass
. I'll need to try recreating my dev env tomorrow to see if there's something awry there.But the goal is that
consul['service_config']
contains a prebuilt list of services that we will handle andconsul['services']
will be an array of the services to enable.Thanks to @twk3 found the issue was that nested Mashes don't work out of the box . He put together !1824 (merged) which could simplify this. In the meantime I updated to use two separate hashes that get merged. So users will only need to specify a
consul['configuration']
hash that contains the options they want to change.With that, setting the following
consul['enable'] = true consul['configuration'] = { 'server' => true }
Results in 3 nodes running the consul server. They were able to form a cluster by running:
consul join host_one
on host_two and host_threeThis still needs some more testing and I need to spin up a new cluster using all of the parts working together. But this is real close to what I think we need. Please take a look if you've got a moment.
added 43 commits
- b4b0d435...80d738a3 - 42 commits from branch
master
- 0d78a153 - Merge branch 'master' into add-consul-to-ee
- b4b0d435...80d738a3 - 42 commits from branch
Merged master as package builds were breaking due to signing key issues.
New package build running in https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/58012
This is not functional yet, but it's a start.
With the consul watcher, every time the postgresql service changes state, it will send a blob like the following to stdin of the handler
{"Node"=> {"ID"=>"8c286a79-3b76-c745-3900-a998fa9106d4", "Node"=>"db_two", "Address"=>"172.21.0.3", "Datacenter"=>"gitlab_consul", "TaggedAddresses"=>{"lan"=>"172.21.0.3", "wan"=>"172.21.0.3"}, "Meta"=>{}, "CreateIndex"=>16, "ModifyIndex"=>20}, "Service"=> {"ID"=>"postgresql", "Service"=>"postgresql", "Tags"=>[], "Port"=>5432, "Address"=>"", "EnableTagOverride"=>false, "CreateIndex"=>20, "ModifyIndex"=>20}, "Checks"=> [{"Node"=>"db_two", "CheckID"=>"serfHealth", "Name"=>"Serf Health Status", "Status"=>"passing", "Notes"=>"", "Output"=>"Agent alive and reachable", "ServiceID"=>"", "ServiceName"=>"", "ServiceTags"=>[]}, {"Node"=>"db_two", "CheckID"=>"service:postgresql", "Name"=>"Service 'postgresql' check", "Status"=>"critical", "Notes"=>"", "Output"=>"chpst: fatal: unable to setgroups: permission denied\n", "ServiceID"=>"postgresql", "ServiceName"=>"postgresql", "ServiceTags"=>[]}
We'll need to update the handler script to grab the node name (which should be the fqdn of the master node) that has a "passing" Status for the Check with CheckID
service:postgresql
.Other TODO:
-
We setup the repmgr service with the name
repmgrd
, so the new omnibus_helper.rb is looking fornode['repmgrd']['enable']
, which doesn't exist. I workaround this is in dev by settingnode.default['repmgrd']['enable'] = true
inrepmgr::enable_daemon
-
There is an ordering issue where the
consul::service_postgresql
recipe fails if it is run before therepmgr::enable
recipe. I should use a dependency there. - The watcher recipe should not be postgresql specific. It should be easy to make it more generic, as service watchers just need to know the service name and the handler path.
Edited by Ian Baum-
We setup the repmgr service with the name
- Resolved by Ian Baum
We setup the repmgr service with the name
repmgrd
, so the new omnibus_helper.rb is looking fornode['repmgrd']['enable']
, which doesn't exist.So repmgr is basically two parts.
- A
repmgr
binary to manage adding and removing nodes from a PostgreSQL cluster - A daemon (
repmgrd
) to automatically promote a new master in case of failure
It's entirely possible to use the
repmgr
binary and not userepmgrd
. So I purposely did not name the runit servicerepmgr
. So that's causing some issues with the new service definition setup.So I think the right way going forward is to
- rename the recipe
repmgr::enable_daemon
torepmgr::repgmrd
- rename the recipe
repmgr::disable_daemon
torepmgr::repmgrd_disable
- add a top level
repmgrd
Mash
I'm also open to other options.
Thoughts? @gitlab-build-team
- A
This is mostly ready for testing. I just need to finish up the
failover_pgbouncer
script and I think everything will be ready.To spin up a test consul cluster, disable everything else, then set
consul['enable'] = true consul['configuration'] = { server: true, bootstrap_expect: 1 }
This will spin up a cluster that only expects one instance, so it will immediately act as a server. Normally it will wait for at least 3 nodes. For production, we will not run with
bootstrap_expect: 1
On the database and pgbouncer nodes, add the following attributes to the previous ones:
consul['enable'] = true
Once the nodes are online, run
/opt/gitlab/embedded/bin/consul join SERVER_NODE
to have them join as agents of the consul clusterIf the
failover_pgbouncer
script were working, that would do it. To test without it, we could switch to using the consul DNS interface for service discovery.There is the caveat that there is no security on this, this is all assuming the network is secure. We may not want to do that moving forward. We can do a shared key for encrypting traffic, and preventing nodes without the key from joining. There shouldn't be any code required on our part to allow this, it will just be a matter of documenting how to do it.
For the latest version of the package: https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/59703
I'm able to consistently spin up a consul server node using the following attributes:
postgresql['enable'] = false nginx['enable'] = false gitlab_rails['auto_migrate'] = false prometheus_monitoring['enable'] = false gitaly['enable'] = false redis['enable'] = false unicorn['enable'] = false gitlab_workhorse['enable'] = false sidekiq['enable'] = false consul['enable'] = true consul['configuration'] = { server: true, bootstrap_expect: 1 }
One thing worth noting is that consul does phone home on startup. If we don't disable this by default, documentation needs to mention that it does this, and how to disable it.
On a related note, 0.9.2 is out.
There were some hiccups with the database node that should be fixed. With the current status, you should now be able to reliably spin up a new database node running postgresql, repmgrd, and consul with
postgresql['enable'] = true bootstrap['enable'] = false nginx['enable'] = false unicorn['enable'] = false sidekiq['enable'] = false redis['enable'] = false prometheus_monitoring['enable'] = false gitaly['enable'] = false gitlab_workhorse['enable'] = false mailroom['enable'] = false postgresql['md5_auth_cidr_addresses'] = ['0.0.0.0/0'] postgresql['listen_address'] = '0.0.0.0' postgresql['sql_user_password'] = 'ea91b906a0940e8a1f5444909afe88ba' # This is the hash generated in the previous step postgresql['trust_auth_cidr_addresses'] = %w(127.0.0.0/24) postgresql['hot_standby'] = 'on' postgresql['wal_level'] = 'replica' postgresql['max_wal_senders'] = 4 postgresql['shared_preload_libraries'] = 'repmgr_funcs' gitlab_rails['auto_migrate'] = false repmgr['enable'] = true repmgr['trust_auth_cidr_addresses'] = %w(172.21.0.0/28) consul['enable'] = true consul['services'] = %w(postgresql)
Latest package build is at https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/60174
Testing is going well. I believe the final step is to give consul permission to reload pgbouncer. For this it needs to update the
databases.ini
file, and reload the daemon.I think the easy way to reload is to add the
gitlab-consul
user to the pgbounceradmin_users
list, and give it an entry in pg_auth. This requires a password, but we can use a.pgpass
file for that. From there, we'd need to udpate thegitlab-ctl pgb-notify
script to connect to the pgbouncer console, and reload. Instead of sending aHUP
like it is now.We can consider skipping updating the
databases.ini
file, and just updating the in memory backend. Otherwise, we would need to give thegitlab-consul
user permissions to write to the file.This is all working expect for one detail. Overriding the attribute for
node['gitlab']['pgbouncer']['database_ini']
isn't working. I'll need to fix that tomorrow.If I manually set it to what I expect, then consul informs pgbouncer of the new master as expected.
I won't remove
WIP
as it's not 100% yet. But this is really close.Latest package is here: https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/60884
https://gitlab.com/ibaum/ha/blob/master/docker-compose.yml is the docker-compose file I used. The image is using our docker build pointing to a dev package.
changed milestone to %9.5
assigned to @marin
@jtevnan Ping
mentioned in issue gitlab-com/infrastructure#2422 (closed)
- Resolved by Ian Baum
- Resolved by Ian Baum
- Resolved by Ian Baum
- Resolved by Ian Baum
@ibaum This MR also needs to add lines in documentation for default users we create and also add to services list with expected port numbers.
We do, check the original issue: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2571#note_35868348
mentioned in issue gitlab-com/infrastructure#2524 (closed)
mentioned in merge request !1858 (merged)
Latest package: https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/60950
- Resolved by Ian Baum
- Resolved by Marin Jankovski
@ibaum Two more things:
- I've hit an edge case where
hostname -f
didn't exist on one node so I got an error:
24: node.default['gitlab']['postgresql']['custom_pg_hba_entries']['repmgr'] = repmgr_helper.pg_hba_entries 25: 26: # node number needs to be unique (to the cluster) positive 32 bit integer. 27: # If the user doesn't provide one, generate one ourselves. 28: node_number = node['repmgr']['node_number'] || 29>> Digest::MD5.hexdigest(node['fqdn']).unpack('L').first 30: template repmgr_conf do 31: source 'repmgr.conf.erb' 32: owner account_helper.postgresql_user
- Like with PG before we have an non-idempotent action for
consul
,repmgrd
andpgbouncer
* file[/opt/gitlab/sv/consul/supervise/ok] action touch - create new file /opt/gitlab/sv/consul/supervise/ok - change owner from '' to 'gitlab-consul' - change group from '' to 'gitlab-consul' - update utime on file /opt/gitlab/sv/consul/supervise/ok * file[/opt/gitlab/sv/consul/log/supervise/ok] action touch - create new file /opt/gitlab/sv/consul/log/supervise/ok - change owner from '' to 'gitlab-consul' - change group from '' to 'gitlab-consul' - update utime on file /opt/gitlab/sv/consul/log/supervise/ok * file[/opt/gitlab/sv/consul/supervise/control] action touch - create new file /opt/gitlab/sv/consul/supervise/control - change owner from '' to 'gitlab-consul' - change group from '' to 'gitlab-consul' - update utime on file /opt/gitlab/sv/consul/supervise/control * file[/opt/gitlab/sv/consul/log/supervise/control] action touch - create new file /opt/gitlab/sv/consul/log/supervise/control - change owner from '' to 'gitlab-consul' - change group from '' to 'gitlab-consul' - update utime on file /opt/gitlab/sv/consul/log/supervise/control
- I've hit an edge case where
With today's changes, to get the cluster up and running without working dns:
- For each db node set
repmgr['host']
to the IP address of the host - For each nodes consul configuration, set:
consul['configuration'] = { retry_join: 'IP of consul servers to join', node_name: 'Something unique that won't be resolved in dns' }
Edited by Ian Baum- For each db node set
When using https://gitlab.com/ibaum/ha/blob/master/docker-compose.yml, after bringing all of the nodes online, you should only need to
- Run
gitlab-ctl repmgr standby setup MASTER_NODE -w
on each of the standby database nodes - Run
gitlab-ctl write-pgpass --hostuser gitlab-consul --user gitlab-consul --port 6432 --database pgbouncer
on the pgbouncer node
Edited by Ian Baum- Run
added 66 commits
- f2384d72...d6ddb028 - 65 commits from branch
master
- 376fa64f - Merge branch 'master' into add-consul-to-ee
- f2384d72...d6ddb028 - 65 commits from branch
https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61054 is the latest package build with master merged
One more fix,
check_postgresql.erb
function has been brought intogitlab-ctl
, but template resource wasn't removed. That has been fixed. New packages in https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61062Also, earlier commit has restructured
pgbouncer['users']
attribute into a hash. So now it should look like:pgbouncer['users'] = { USERNAME: { password: PASSWORDHASH } }
I've hit another edge case that we'll need to consider, it can go into a separate issue/MR:
If you don't have
/etc/gitlab/skip-auto-migrations
on your app node, upgrade will fail because password is expected:Dumping PostgreSQL database gitlabhq_production ... Password: pg_dump: [archiver (db)] connection to database "gitlabhq_production" failed: fe_sendauth: no password supplied
So for the first issue, do you still have the error handy? Was ohai just not setting that attribute?
TypeError --------- no implicit conversion of nil into String Cookbook Trace: --------------- /opt/gitlab/embedded/cookbooks/cache/cookbooks/repmgr/recipes/enable.rb:29:in `digest' /opt/gitlab/embedded/cookbooks/cache/cookbooks/repmgr/recipes/enable.rb:29:in `hexdigest' /opt/gitlab/embedded/cookbooks/cache/cookbooks/repmgr/recipes/enable.rb:29:in `from_file' /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab-ee/recipes/default.rb:41:in `block in from_file' /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab-ee/recipes/default.rb:39:in `each' /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab-ee/recipes/default.rb:39:in `from_file' Relevant File Content:
After some more testing with @marin we have the following issues:
Major
- Consul currently isn't marking a service as dead if the agent goes away. I believe this is just a configuration option in the check definition.
- The postgresql check script is not correctly detecting a situation with multiple masters
-
The
repmgr::enable
recipe doesn't handle the situation wherenode['fqdn']
isnil
. We use this for thenode_number
in repmgr. After some discussion, we're going to instead concatenatenode['ipaddress']
,node['macaddress']
, andnode['ip6address']
together.
I'll focus on fixing these today
Minor (or at least not as major)
- Database backup is failing on package upgrade => Issue https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2694
- repmgr logrotation isn't working. It doesn't look like repmgrd gets the new log rotation, it's possible still pointing to the old file handle. => Issue https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2695
- repmgrd was not working on an old master after we converted it to a standby. => Issue https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2696
Edited by Marin JankovskiIf we fix the first three, we will get this merged into the next RC.
The reason for this is that this is new functionality which should not be touching existing behaviour and that we are fairly confident that things are operational in majority of cases.
/cc @stanhu
Edited by Marin JankovskiFirst three issues have been fixed, and package is being built in: https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61139
So as it turns out, the Ubuntu 16.04 image that we base our docker image on is lacking the necessary packages that ohai uses to gather network information.
So ohai in our docker image gets nil for the
ipaddress
,macaddress
, andip6adress
attributes.I'm updating to base off of fqdn if available, otherwise use the network info. We could raise a fatal error if none of those are available too, as this is going to be one of those really frustrating situations if it breaks.
/cc @marin @WarheadsSE
The fix for checking for multiple masters didn't go according to plan either.
I've switched that, and now the failover script will take the agent health (service check
serfHealth
) into account as well when locating masters.It doesn't look like they consider this situation a bug.
Latest package build is now here https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61140
I left out some options to the updated pgb-notify. New package is at https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61141
I believe I've addressed the 3 issues we saw today.
mentioned in issue #2694
mentioned in issue #2695 (closed)
mentioned in issue #2696 (closed)