Geo: make repositories location immutable
We have some challenges for https://gitlab.com/gitlab-org/gitlab-ee/issues/2828 and https://gitlab.com/gitlab-org/gitlab-ee/issues/2827. Based on earlies discussions around new Geo replication architecture, making this type of change reliable, would require a global lock mechanism, which can be slow, creates a single point of failure, makes restoring from a backup harder as we have to synchronize state between database and the filesystem, etc.
Moving/renaming is problematic because operations that update the repository points to the full path, which if changed will result in catastrophic events like unintended data breaches (exposing a secret repository, or commmits intended to repository A going to B, because of lack of synchronization, etc), data loss, etc.
The same can be said to creating/removing repositories. A lack of synchronization of the events will make the replication behave incorrectly.
But there is another way.
If we assume the repository path will never change, we will always be replicating the events to the correct location. Renaming and Moving should be a "virtual" change on the database, and should not be reflected on the disk.
This is already partially done by the introduction of the multiple storages, which maps repositories to different places on the local disk, based on which "storage" it is.
Removing a repository will always remove the "correct" one, no mater when this event happens, also creating a new repository in the same fullpath
will never ever rewrite existing ones, as this will be stored in a different place (remember, it's based on the ID, not on the name or the namespace).
Proposal (simple)
Repository UUID will be built as: MD5("project-#{project.id}")
We can store on the disk as: "#{uuid[0..1]}/#{uuid}.git"
and "#{uuid[0..1]}/#{uuid}.wiki.git"
Or we can even use more levels to help with filesystem evenly allocation of repositories on folders:
"#{uuid[0]}/#{uuid[0..1]}/#{uuid}.git"
or more levels if we think this is necessary. By using a predictable hash function we can even make operations without using the database to figure out where the file is on disk, and a simple script can implement the same algorithm.
Building on top of https://gitlab.com/gitlab-org/gitlab-ce/issues/28283 proposal, if security is a concern, we can add a "salt" to the hash function, which can be stored in the secrets file, so UUID would became unique for every gitlab install, like: MD5("#{secret-salt}/project-#{project.id}")
.
MD5 is used here just for simplicity, but we can consider alternatives like SHA256, and/or use the database to prevent a collision from allowing the repository to be created, etc.
Initial deployment
The "concern" where to store/read the repository from the disk should use a Bridge pattern and became switchable. So we can keep existing instances reading from where it normally does, and in a clean secondary Geo node, we can bootstrap it with the new format.
This will allow a fast bootstraping of the solution as we don't need to implement the migrate existing data and we can use this to prototype/test the solution before going into the gitlab.com scale test.
This also provides a path to migrate existing data in the future (to be described below)
Migrating from old to new format
As we will have the two implementations, we can use symlinks or hardlinks on disk and start to backfill the required structure to the format. We can then do a slow Rollout, machine by machine, just changing the local configuration to use the new implementation, and because of the symlinks/hardlinks both will still be able to read from the disk.
Ex:
1d/1d90357640b85550b35989198e93fee9.git -> gitlab/gitlab_ce.git
Old format will read from gitlab/gitlab_ce.git
, new will read from 1d/1d90357640b85550b35989198e93fee9.git
After all machines are Rolledout, we can mv gitlab/gitlab_ce.git 1d/1d90357640b85550b35989198e93fee9.git
and have the repository moved to the new path.
cc @stanhu @dbalexandre @rspeicher @DouweM @pcarranza @smcgivern @lbot