Skip to content
Snippets Groups Projects

Adding consul to EE build

Merged Ian Baum requested to merge add-consul-to-ee into master

Install and configure consul. Initially for use in the PG HA configuration

Related to #2571

Edited by Ian Baum

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Ian Baum changed the description

    changed the description

  • added HA postgresql labels

  • Author Developer

    It looks like 0.9.1 is due for imminent release, so I didn't want to pass the MR along with 0.9.0. I'll develop alongside that, and eval what is available when I'm ready for this to not be WIP anymore.

  • Author Developer

    Currently, setting consul['enable'] = true will enable a consul agent on the node. Adding consul['configuration']['server'] = true will make consul run as a server. It will be standalone unless joined to a cluster.

    No security has been done yet at all. I don't think SSL out of the box is feasible, but we should be able to take advantage of the builtin gossip encryption. We can setup attributes to allow SSL as well. ACLs also need to be investigated.

  • Author Developer

    Added a first pass at adding service definitions and checks. Currently it does not work 100% in my dev environment.

    With this set of changes, setting consul['configuration']['server'] = true ends up with an error: undefined method '[]=' for nil:NilClass. I'll need to try recreating my dev env tomorrow to see if there's something awry there.

    But the goal is that consul['service_config'] contains a prebuilt list of services that we will handle and consul['services'] will be an array of the services to enable.

  • Author Developer

    Thanks to @twk3 found the issue was that nested Mashes don't work out of the box . He put together !1824 (merged) which could simplify this. In the meantime I updated to use two separate hashes that get merged. So users will only need to specify a consul['configuration'] hash that contains the options they want to change.

    With that, setting the following

    consul['enable'] = true
    consul['configuration'] = {
      'server' => true
    }

    Results in 3 nodes running the consul server. They were able to form a cluster by running: consul join host_one on host_two and host_three

  • Author Developer

    The plan for the next step is to get the agents working on other nodes and talking to the cluster, and add the postgresql service definition and check working.

  • Author Developer

    @marin

    This still needs some more testing and I need to spin up a new cluster using all of the parts working together. But this is real close to what I think we need. Please take a look if you've got a moment.

  • Ian Baum added 43 commits

    added 43 commits

    • b4b0d435...80d738a3 - 42 commits from branch master
    • 0d78a153 - Merge branch 'master' into add-consul-to-ee

    Compare with previous version

  • Author Developer

    Merged master as package builds were breaking due to signing key issues.

    New package build running in https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/58012

  • Author Developer

    This is not functional yet, but it's a start.

    With the consul watcher, every time the postgresql service changes state, it will send a blob like the following to stdin of the handler

    {"Node"=>
       {"ID"=>"8c286a79-3b76-c745-3900-a998fa9106d4",
        "Node"=>"db_two",
        "Address"=>"172.21.0.3",
        "Datacenter"=>"gitlab_consul",
        "TaggedAddresses"=>{"lan"=>"172.21.0.3", "wan"=>"172.21.0.3"},
        "Meta"=>{},
        "CreateIndex"=>16,
        "ModifyIndex"=>20},
      "Service"=>
       {"ID"=>"postgresql", "Service"=>"postgresql", "Tags"=>[], "Port"=>5432, "Address"=>"", "EnableTagOverride"=>false, "CreateIndex"=>20, "ModifyIndex"=>20},
      "Checks"=>
       [{"Node"=>"db_two",
         "CheckID"=>"serfHealth",
         "Name"=>"Serf Health Status",
         "Status"=>"passing",
         "Notes"=>"",
         "Output"=>"Agent alive and reachable",
         "ServiceID"=>"",
         "ServiceName"=>"",
         "ServiceTags"=>[]},
        {"Node"=>"db_two",
         "CheckID"=>"service:postgresql",
         "Name"=>"Service 'postgresql' check",
         "Status"=>"critical",
         "Notes"=>"",
         "Output"=>"chpst: fatal: unable to setgroups: permission denied\n",
         "ServiceID"=>"postgresql",
         "ServiceName"=>"postgresql",
         "ServiceTags"=>[]}

    We'll need to update the handler script to grab the node name (which should be the fqdn of the master node) that has a "passing" Status for the Check with CheckID service:postgresql.

  • Author Developer

    Other TODO:

    • We setup the repmgr service with the name repmgrd, so the new omnibus_helper.rb is looking for node['repmgrd']['enable'], which doesn't exist. I workaround this is in dev by setting node.default['repmgrd']['enable'] = true in repmgr::enable_daemon
    • There is an ordering issue where the consul::service_postgresql recipe fails if it is run before the repmgr::enable recipe. I should use a dependency there.
    • The watcher recipe should not be postgresql specific. It should be easy to make it more generic, as service watchers just need to know the service name and the handler path.
    Edited by Ian Baum
  • Ian Baum
  • Author Developer

    We setup the repmgr service with the name repmgrd, so the new omnibus_helper.rb is looking for node['repmgrd']['enable'], which doesn't exist.

    So repmgr is basically two parts.

    1. A repmgr binary to manage adding and removing nodes from a PostgreSQL cluster
    2. A daemon (repmgrd) to automatically promote a new master in case of failure

    It's entirely possible to use the repmgr binary and not use repmgrd. So I purposely did not name the runit service repmgr. So that's causing some issues with the new service definition setup.

    So I think the right way going forward is to

    • rename the recipe repmgr::enable_daemon to repmgr::repgmrd
    • rename the recipe repmgr::disable_daemon to repmgr::repmgrd_disable
    • add a top level repmgrd Mash

    I'm also open to other options.

    Thoughts? @gitlab-build-team

  • Author Developer

    This is mostly ready for testing. I just need to finish up the failover_pgbouncer script and I think everything will be ready.

    To spin up a test consul cluster, disable everything else, then set

    consul['enable'] = true
    consul['configuration'] = { 
      server: true,
      bootstrap_expect: 1
    }

    This will spin up a cluster that only expects one instance, so it will immediately act as a server. Normally it will wait for at least 3 nodes. For production, we will not run with bootstrap_expect: 1

    On the database and pgbouncer nodes, add the following attributes to the previous ones:

    consul['enable'] = true

    Once the nodes are online, run /opt/gitlab/embedded/bin/consul join SERVER_NODE to have them join as agents of the consul cluster

    If the failover_pgbouncer script were working, that would do it. To test without it, we could switch to using the consul DNS interface for service discovery.

    There is the caveat that there is no security on this, this is all assuming the network is secure. We may not want to do that moving forward. We can do a shared key for encrypting traffic, and preventing nodes without the key from joining. There shouldn't be any code required on our part to allow this, it will just be a matter of documenting how to do it.

  • Author Developer
  • Author Developer

    I'm able to consistently spin up a consul server node using the following attributes:

    postgresql['enable'] = false
    nginx['enable'] = false
    gitlab_rails['auto_migrate'] = false
    prometheus_monitoring['enable'] = false
    gitaly['enable'] = false
    redis['enable'] = false
    unicorn['enable'] = false
    gitlab_workhorse['enable'] = false
    sidekiq['enable'] = false
    consul['enable'] = true
    consul['configuration'] = {
         server: true,
         bootstrap_expect: 1
    }

    One thing worth noting is that consul does phone home on startup. If we don't disable this by default, documentation needs to mention that it does this, and how to disable it.

    On a related note, 0.9.2 is out.

    There were some hiccups with the database node that should be fixed. With the current status, you should now be able to reliably spin up a new database node running postgresql, repmgrd, and consul with

    postgresql['enable'] = true
    bootstrap['enable'] = false
    nginx['enable'] = false
    unicorn['enable'] = false
    sidekiq['enable'] = false
    redis['enable'] = false
    prometheus_monitoring['enable'] = false
    gitaly['enable'] = false
    gitlab_workhorse['enable'] = false
    mailroom['enable'] = false
    postgresql['md5_auth_cidr_addresses'] = ['0.0.0.0/0']
    postgresql['listen_address'] = '0.0.0.0'
    postgresql['sql_user_password'] = 'ea91b906a0940e8a1f5444909afe88ba' # This is the hash generated in the previous step
    postgresql['trust_auth_cidr_addresses'] = %w(127.0.0.0/24)
    postgresql['hot_standby'] = 'on'
    postgresql['wal_level'] = 'replica'
    postgresql['max_wal_senders'] = 4
    postgresql['shared_preload_libraries'] = 'repmgr_funcs' 
    gitlab_rails['auto_migrate'] = false
    repmgr['enable'] = true
    repmgr['trust_auth_cidr_addresses'] = %w(172.21.0.0/28)
    consul['enable'] = true
    consul['services'] = %w(postgresql)
  • Author Developer

    Testing is going well. I believe the final step is to give consul permission to reload pgbouncer. For this it needs to update the databases.ini file, and reload the daemon.

    I think the easy way to reload is to add the gitlab-consul user to the pgbouncer admin_users list, and give it an entry in pg_auth. This requires a password, but we can use a .pgpass file for that. From there, we'd need to udpate the gitlab-ctl pgb-notify script to connect to the pgbouncer console, and reload. Instead of sending a HUP like it is now.

    We can consider skipping updating the databases.ini file, and just updating the in memory backend. Otherwise, we would need to give the gitlab-consul user permissions to write to the file.

  • Author Developer

    This is all working expect for one detail. Overriding the attribute for node['gitlab']['pgbouncer']['database_ini'] isn't working. I'll need to fix that tomorrow.

    If I manually set it to what I expect, then consul informs pgbouncer of the new master as expected.

    I won't remove WIP as it's not 100% yet. But this is really close.

  • Author Developer

    This is working, I was able to spin up a new instance, and failover automatically without error. There is still some polishing that could be done, but I think it is definitely time for a review.

  • Author Developer

    Latest package is here: https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/60884

    https://gitlab.com/ibaum/ha/blob/master/docker-compose.yml is the docker-compose file I used. The image is using our docker build pointing to a dev package.

  • Ian Baum unmarked as a Work In Progress

    unmarked as a Work In Progress

  • Ian Baum changed milestone to %9.5

    changed milestone to %9.5

  • assigned to @marin

  • Marin Jankovski
  • Marin Jankovski
  • Marin Jankovski
  • Marin Jankovski
  • @ibaum This MR also needs to add lines in documentation for default users we create and also add to services list with expected port numbers.

  • Do we have any observations on the increase in requirements when this is active? CPU/Disk/RAM

  • Ian Baum mentioned in merge request !1858 (merged)

    mentioned in merge request !1858 (merged)

  • Ian Baum resolved all discussions

    resolved all discussions

  • Marin Jankovski
  • Marin Jankovski
  • @ibaum Two more things:

    • I've hit an edge case where hostname -f didn't exist on one node so I got an error:
     24:  node.default['gitlab']['postgresql']['custom_pg_hba_entries']['repmgr'] = repmgr_helper.pg_hba_entries
       25:
       26:  # node number needs to be unique (to the cluster) positive 32 bit integer.
       27:  # If the user doesn't provide one, generate one ourselves.
       28:  node_number = node['repmgr']['node_number'] ||
       29>>   Digest::MD5.hexdigest(node['fqdn']).unpack('L').first
       30:  template repmgr_conf do
       31:    source 'repmgr.conf.erb'
       32:    owner account_helper.postgresql_user
    • Like with PG before we have an non-idempotent action for consul, repmgrd and pgbouncer
    * file[/opt/gitlab/sv/consul/supervise/ok] action touch
        - create new file /opt/gitlab/sv/consul/supervise/ok
        - change owner from '' to 'gitlab-consul'
        - change group from '' to 'gitlab-consul'
        - update utime on file /opt/gitlab/sv/consul/supervise/ok
      * file[/opt/gitlab/sv/consul/log/supervise/ok] action touch
        - create new file /opt/gitlab/sv/consul/log/supervise/ok
        - change owner from '' to 'gitlab-consul'
        - change group from '' to 'gitlab-consul'
        - update utime on file /opt/gitlab/sv/consul/log/supervise/ok
      * file[/opt/gitlab/sv/consul/supervise/control] action touch
        - create new file /opt/gitlab/sv/consul/supervise/control
        - change owner from '' to 'gitlab-consul'
        - change group from '' to 'gitlab-consul'
        - update utime on file /opt/gitlab/sv/consul/supervise/control
      * file[/opt/gitlab/sv/consul/log/supervise/control] action touch
        - create new file /opt/gitlab/sv/consul/log/supervise/control
        - change owner from '' to 'gitlab-consul'
        - change group from '' to 'gitlab-consul'
        - update utime on file /opt/gitlab/sv/consul/log/supervise/control
  • Author Developer

    So for the first issue, do you still have the error handy? Was ohai just not setting that attribute?

    This branch was cut before the runit fix was put in. I'll rebase against master and build a new package with tonight's changes.

  • Author Developer

    With today's changes, to get the cluster up and running without working dns:

    1. For each db node set repmgr['host'] to the IP address of the host
    2. For each nodes consul configuration, set:
      consul['configuration'] = { 
        retry_join: 'IP of consul servers to join',
        node_name: 'Something unique that won't be resolved in dns'
      }
    Edited by Ian Baum
  • Author Developer

    When using https://gitlab.com/ibaum/ha/blob/master/docker-compose.yml, after bringing all of the nodes online, you should only need to

    1. Run gitlab-ctl repmgr standby setup MASTER_NODE -w on each of the standby database nodes
    2. Run gitlab-ctl write-pgpass --hostuser gitlab-consul --user gitlab-consul --port 6432 --database pgbouncer on the pgbouncer node
    Edited by Ian Baum
  • Ian Baum added 66 commits

    added 66 commits

    • f2384d72...d6ddb028 - 65 commits from branch master
    • 376fa64f - Merge branch 'master' into add-consul-to-ee

    Compare with previous version

  • Author Developer

    https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61054 is the latest package build with master merged

  • Author Developer

    One more fix, check_postgresql.erb function has been brought into gitlab-ctl, but template resource wasn't removed. That has been fixed. New packages in https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61062

    Also, earlier commit has restructured pgbouncer['users'] attribute into a hash. So now it should look like:

    pgbouncer['users'] = {
      USERNAME: { 
        password: PASSWORDHASH
      }
    }
  • I've hit another edge case that we'll need to consider, it can go into a separate issue/MR:

    If you don't have /etc/gitlab/skip-auto-migrations on your app node, upgrade will fail because password is expected:

    Dumping PostgreSQL database gitlabhq_production ... Password:
    pg_dump: [archiver (db)] connection to database "gitlabhq_production" failed: fe_sendauth: no password supplied
  • So for the first issue, do you still have the error handy? Was ohai just not setting that attribute?

     TypeError
      ---------
      no implicit conversion of nil into String
    
      Cookbook Trace:
      ---------------
        /opt/gitlab/embedded/cookbooks/cache/cookbooks/repmgr/recipes/enable.rb:29:in `digest'
        /opt/gitlab/embedded/cookbooks/cache/cookbooks/repmgr/recipes/enable.rb:29:in `hexdigest'
        /opt/gitlab/embedded/cookbooks/cache/cookbooks/repmgr/recipes/enable.rb:29:in `from_file'
        /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab-ee/recipes/default.rb:41:in `block in from_file'
        /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab-ee/recipes/default.rb:39:in `each'
        /opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab-ee/recipes/default.rb:39:in `from_file'
    
      Relevant File Content:
  • Author Developer

    After some more testing with @marin we have the following issues:

    Major

    • Consul currently isn't marking a service as dead if the agent goes away. I believe this is just a configuration option in the check definition.
    • The postgresql check script is not correctly detecting a situation with multiple masters
    • The repmgr::enable recipe doesn't handle the situation where node['fqdn'] is nil. We use this for the node_number in repmgr. After some discussion, we're going to instead concatenate node['ipaddress'], node['macaddress'], and node['ip6address'] together.

    I'll focus on fixing these today

    Minor (or at least not as major)

    Edited by Marin Jankovski
  • If we fix the first three, we will get this merged into the next RC.

    The reason for this is that this is new functionality which should not be touching existing behaviour and that we are fairly confident that things are operational in majority of cases.

    /cc @stanhu

    Edited by Marin Jankovski
  • Author Developer

    First three issues have been fixed, and package is being built in: https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61139

  • Author Developer

    So as it turns out, the Ubuntu 16.04 image that we base our docker image on is lacking the necessary packages that ohai uses to gather network information.

    So ohai in our docker image gets nil for the ipaddress, macaddress, and ip6adress attributes.

    I'm updating to base off of fqdn if available, otherwise use the network info. We could raise a fatal error if none of those are available too, as this is going to be one of those really frustrating situations if it breaks.

    /cc @marin @WarheadsSE

  • Author Developer

    The fix for checking for multiple masters didn't go according to plan either.

    I've switched that, and now the failover script will take the agent health (service check serfHealth) into account as well when locating masters.

    It doesn't look like they consider this situation a bug.

  • Author Developer
  • Author Developer

    I left out some options to the updated pgb-notify. New package is at https://dev.gitlab.org/gitlab/omnibus-gitlab/pipelines/61141

    I believe I've addressed the 3 issues we saw today.

  • Marin Jankovski resolved all discussions

    resolved all discussions

  • Marin Jankovski mentioned in issue #2694

    mentioned in issue #2694

  • mentioned in issue #2695 (closed)

  • mentioned in issue #2696 (closed)

  • Please register or sign in to reply
    Loading