Geo: investigate alternative to geo_{primary|secondary}_role in gitlab.yaml
Currently we are using Gitlab.config.geo_primary_role['enabled'] and Gitlab.config.geo_secondary_role['enabled'] to overcome some issues we found previously in the following issues:
- https://gitlab.com/gitlab-org/gitlab-ee/issues/2289
- https://gitlab.com/gitlab-org/gitlab-ee/issues/2181
- https://gitlab.com/gitlab-org/gitlab-ee/issues/2049
Related MRs and Issues:
- https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1987
- https://gitlab.com/gitlab-org/gitlab-ee/issues/2760
- https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/2099
- https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/2071
- https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/1608
- gitlab-org/gitlab-ee!2304
But this new configuration will prevent anyone trying to setup HA as we currently rely on the roles configured in omnibus to activate the roles in gitlab.yml
. This was not the intended behavior for Omnibus roles. They are supposed to be used as a shortcut to trigger existing configurations as a way to simplify common setups.
So right now this decision to use the Omnibus roles is blocking https://gitlab.com/gitlab-org/gitlab-ee/issues/2825 and we have here two alternatives:
We implement 'yet' another configuration flag in omnibus, use that to write to the gitlab.yml
the roles and then add this new flag to the Omnibus roles, or we can try to remove the roles from gitlab.yml
and simplify the code again.
Proposal
The problematic parts of the code are: initializers and the active record connection. For all the other pieces we can rely on the database and query it to see what is configured or not.
On the initializer side, we have sidekiq configuration enabling or disabling cronjobs.
On the ActiveRecord side we need to establish connection to the DR database when in a secondary nodes.
What we can do here is use the existence of the database_geo.yml
file to determine wether we are going to override the connection for the DR models or not, so we don't need the geo_secondary_role
for that.
We can use the nonexistence of the database_geo.yml
file to decide wether we are going to try to update_clone_url
for the primary node. So if we are in a machine that doesn't have this file, we can always try to update (it will not do anything if there are no configured nodes). We already use a rescue there to fail gracefully when database is unavailable etc for rake tasks so we are covered.
The last part and most unknown is sidekiq cronjobs configuration. Our current code requires a restart when you change from primary to secondary, or when an existing newly configured node is configures as secondary after the replication started and the server is running.
We can fix this by introducing a check in every non-geo entry to bypass the execution AND reconfigure the cronjobs. Same thing for the Geo cron jobs, in a non Geo environment, it will skip the job and reconfigure the cron.
What you all think?