CE randomly puts new projects in the null shard (/home/git/repositories) or a non-null shard (/home/git/repositories/-shard1). This forces the CE code to correctly handle 'legacy projects' (null shard) and projects stored in a shard.
Then EE adds:
tooling to move projects between shards
smart (non-random) shard assignment for new projects
having a possibility to specify where some of the repositories are stored is also beneficial, because we can slowly migrate all repositories to a new storage, and possibly migrate away from this storage
The idea would be to have multiple mount points (which could be NFS based for now) and have the ability to set every project to a specific share, move them around using some rake task (or from the GUI) or even setting the application to use a particular share for every new project so we stop growing the same NFS server all the time.
This way we could avoid having just one SPOF (and have many, heh), but particularly we could distribute the storage of repos easily while still running the application. Splitting the heavy load we are pushing into one server into, say, 4 of them, reducing the this crazy FS load and improving availability.
This is not a final solution for the storage problem, but certainly it looks like a low hanging fruit that will buy us a lot of time.
After a quick dive in the code by @ayufan it looks like it would not be extremely hard to do it.
Can we spend a bit of time filling this idea with holes to see where will it fail?
After a quick dive in the code by @ayufan it looks like it would not be extremely hard to do it.
We would still need to look for a lot of edge cases, but since repository management is not spread across application it seems that this could be done.
The interesting thing is that this would allow us to do smooth transition from one storage to another.
+1, complexity should be modest and it would buy us room to breathe in the short term. I think it is suitable for 2x-4x growth (i.e. 2-4 NFS servers), not for 10x. To make it sustainable in the medium term we would need a solution that can keep going when one storage server is out (i.e. with partial availability of repos).
I was thinking that we can do a really small experiment once we have this and is to actually test stuff in production.
We could have our GitLab organization repos re-mounted into a ceph or lustre partition and check how it works while we work with it.
We will need to be able of separating what metrics belong to one mount point and what metrics belong to a different mount point because I would love to be able of comparing apples to apples at the application level from performance.gitlab.net.
@ayufan what if we put the 'storage shards' in the DB only? Projects get a 'storage_shard_id' column. The 'storage_shards' table has id, path. To find a repo do:
I don't see why we need to couple this to backups. To GitLab, a storage shard would just be a directory. What is mounted at that directory is the problem of the OS.
And as @jacobvosmaer-gitlab this helps to scale 3x and to move to new storage. We'll still need a properly distributed solution that includes redundancy and rebalancing in the end.
I foresee problems with rolling this on our own but it looks like we have no choice due to the NFS server already redlining.
@jacobvosmaer-gitlab Seems nice, but I'm not big fan of storing the shard_paths in DB. For me it feels like a system level configuration, rather then application level configuration :)
Let's imagine that you will have that configured in Omnibus, which can:
Pass to GitLab Rails app,
Make sure that all permissions are correct,
Make sure that mountpoints are correct.
If we put that in database it will be harder to fulfil the 1. req :)
@ayufan I propose to only allow relative subdirectories for storage shards. For example /home/git/repositories/-shard1 (I am hoping -shard1 is an illegal namespace path in GitLab so we cannot have clashes). That way we only need /home/git/repositories to guard access to the repositories, which is what we already do. I think that addresses your point 2.
I think perhaps mount points (point 3) should be none of GitLab's business. The OS can handle this transparently.
If I recall correctly @northrup said he would work on this during yesterday's '8.9 Kickoff meeting'.
One thing to look for is where the string repos_pathis used in the gitlab-ce code base. This will guide you to code that wants to know the path on disk to an actual repository.
As @ayufan pointed out yesterday, gitlab-workhorse should require zero changes. Gitlab-shell would have to be told by /api/v3/internal/allowed what the exact path to a project is when handling Git-over-SSH.
Then there is the question how to migrate repositories between storage shards. In the first version this can probably be a manual process (no automatic rebalancing etc.).
Another thing we need to thing about early is where to implement all this. On the one hand supporting large amounts of storage sounds like an EE feature. On the other hand at least some of this has to be handled in CE already because otherwise we would have a constant source of 'repository not found' bugs in EE.
I think we will be at a constant risk of contributed code ignoring the storage sharding and breaking EE. The less of the functionality is 'real' in CE, the higher the risk.
I agree with @jacobvosmaer-gitlab, it will be easier to maintain the code if this (in most parts) will be in CE. Maybe this should be hidden feature in CE (with really ugly configs), but really surfaced in EE (with nice configuration, health check of storages, easy to use migration between storages, etc.)?
CE randomly puts new projects in the null shard (/home/git/repositories) or a non-null shard (/home/git/repositories/-shard1). This forces the CE code to correctly handle 'legacy projects' (null shard) and projects stored in a shard.
Then EE adds:
tooling to move projects between shards
smart (non-random) shard assignment for new projects
I'm not sure about putting everything in /home/git/repositories, because it creates a lot of mess. We will have a repos in groups lying in top level directory, but we also will have directories for shards in which we will have a repos in groups.
This also introduces not nice issues with mounting:
You always have to mount /home/git/repositories first.
Only then you can mount the /home/git/repositories/-shard.
In your externally hosted /home/git/repositories you have to have a -shard directory.
I agree with @ayufan, hosting git repos in the same directory as shard directories for mounting is a recipe for a hot mess, can we have /home/git/repositories/-local and /home/git/repositories/-shard[1-n] ?
What worries me is how to transition existing installations to a sharded layout. We cannot force everybody to change their mountpoints from e.g. /home/git/repositories to /home/git/repositories/-local . We also cannot quickly move all repositories into a subdirectory like that.
It would be bad if 'legacy' GitLab installations have everything under /home/git/repositories, and a handful of 'sharded' installs have /home/git/repositories/-shard1 etc. Then we will constantly see code contributions coming in that only works for the 'legacy' layout, which is a maintenance nightmare.
Had a call with @pcarranza and @ayufan today. We think the best way forward is something like @ayufan's initial proposal (storage shards can be anywhere, gitlab.yml maps shard names to paths on disk) and monkey patching the Gitlab.config.gitlab_shell.repos_path method to raise an exception when developers try to write code that looks up repos on disk without going through the new 'shard' API.
My first question is regarding the interaction between gitlab-shell and ce/ee. As I currently understand it (bear with me as I get familiar with the infrastructure ;) there are two main flows:
GitLab CE/EE calls gitlab-shell (mainly using the GitLab::ShellAdapter) in several steps of its workflow. In these cases I think what we have to do is modify our calls to the shell to give it the real path accounting for shards, and modify the shell to use this full path.
gitlab-shell gets called from git over ssh. This, if I understand correctly, is what @jacobvosmaer-gitlab was talking about, where shell must call /api/v3/internals to know what's the shard of the project.
Am I seeing this correctly? I think this would mean that gitlab-shell would loose the repos_pathkey in its config.yml (and not have the new shard configuration replicated in its config), since now it will always get it from CE.
@eReGeBe Yes, I believe that is basically right. Right now gitlab-shell sends a POST /api/v3/internal/allowed to check for permissions. It looks like other Rake tasks use repos_path (e.g. backup/restore/import), so we would need to figure out how to deal with those cases too.
So we are using gitlab-shell from ssh, there we call the allowed api endpoint and ask for the project name for the given key. The path for the project should be returned from this endpoint.
Now a question about namespaces: We had discussed adding the new field to indicate the shard (I'm calling it repository_storage for the moment, but we can change it) on projects, but namespaces also interact with gitlab-shell regarding the file system to ensure_dir_exist, move_dir and rm_dir.
I can think of two options:
Add the repository_storage field to namespaces instead. This would require all projects in a namespace to be in the same shard.
Move the directory logic to projects. This means:
The project ensure_dir_exist on before_create.
Moving/deleting a namespace implies iterating over each project of that namespace and having them do the mv, checking if another project on the same shard didn't do it already.
Difficult decision. I think setting shards at the project level is better (not forcing an entire namespace to live on the same shard). But it makes the change we are doing here harder because we have to move the directory logic to the project.
I think putting directory logic in the project is the 'right' thing to do but it is hard to oversee the impact.
The main problem with namespaces and projects is that in a few places we work on namespace level. Where we actually move a group of projects. It shouldn't be that hard, since this exclusively is done in GitLab Rails. In case of these methods (they are implemented in Gitlab::Shell which for some operations calls gitlab-shell, but not for these):
# Add empty directory for storing repositories # # Ex. # add_namespace("gitlab") # def add_namespace(name) FileUtils.mkdir(full_path(name), mode: 0770) unless exists?(name) end # Remove directory from repositories storage # Every repository inside this directory will be removed too # # Ex. # rm_namespace("gitlab") # def rm_namespace(name) FileUtils.rm_r(full_path(name), force: true) end # Move namespace directory inside repositories storage # # Ex. # mv_namespace("gitlab", "gitlabhq") # def mv_namespace(old_name, new_name) return false if exists?(new_name) || !exists?(old_name) FileUtils.mv(full_path(old_name), full_path(new_name)) end
It should be fine to replay this operation on all shards, rather then working on specific one.
I had a call with @pcarranza about this, then read through the issue to see the proposed solutions. I think @jacobvosmaer-gitlab's idea of storing things on DB level is the best approach as it allows easy shard management without having to do the whole Chef/Omnibus dance (also saves us from adding code to Omnibus to manage config files, etc).
It may have been suggested already but the easiest setup I can think of is as follows:
The table repository_shards contains data about each shard, in particular:
The ID
The name: a simple string (e.g. shard1), mostly for debugging purposes
The root path containing the directories (e.g. /opt/gitlab/shards/shard1), in here a set of repository directories can be stored
Path wise this means you'll end up with something like:
Note that it should be possible for repositories (in the same group) to span multiple shards. This makes moving repositories much easier as you can move them individually instead of having to move an entire group together.
Structure wise this would be (code is from the top of my head, the syntax may be invalid):
For the projects table we then add a shard_id column which just points to the shards table. In the Project model we then make sure there's a method that returns the full path to the repository (either using the shard or a default path from gitlab.yml). This path can then be used by methods such as Project#repository to initialize any Rugged objects or work with gitlab-shell.
gitlab-shell in turn would have to get the shard path (or the full repository path, even easier) from the Grape API. @pcarranza mentioned adding this to the authorized keys API but this API only exists in EE so I think we'd have to either move this to CE or use a separate API. Either way it gets it from the API instead of using a path hard-coded in the config.yml file.
@pcarranza also mentioned that shards should be checked upon boot and that a boot should fail if a shard doesn't exist. I'm not sure if we should do this right from the start, but it's worth keeping in mind. If we do this we have to make sure the code in question works when the DB is disabled (e.g. when the USE_DB environment variable is set to false).
We will also need a way to lock repositories for writes. This allows us to migrate repositories between shards without having to worry about any Git changes. This can be something as simple as adding a allow_repository_writes column (boolean, default true) to projects and setting this to false when migrating a project. The Grape API call used to get a repository path can then raise some kind of error saying "Sorry, this repo is in read-only mode". Note that this would allow to both regular and wiki repositories, unless we want separate columns for both.
Finally, I discussed with @pcarranza that it's probably best for me (unless somebody else wants to jump in) to take care of gitlab-shell and tying this to Rails so @eReGeBe can focus on the Rails side of things.
Any thoughts?
p.s. I may have overlooked some suggestions. Keep in mind that I'm not dismissing any suggestions, I may have honestly just been dumb and missed them :)
@yorickpeterse: I'm with @ayufan in that this feels more like a system level configuration than an application level configuration and prefer to have the shards paths in a config file. This would make it more straightforward to do the boot-up checks that @pcarranza talks about (it would be a small change in what lib/tasks/gitlab/check.rake currently does). I also think it's weird that depending on the case you'd get the shard path from the DB or from gitlab.yml (if you need to use the default one) and I think it would be nicer to have it all in one place. However, I am not aware of the weight and pain of implementing such config changes in Omnibus so I don't have that knowledge to weight it against, and that may make the balance change.
I already have some change in gitlab-shell that I've made to get a feeling on how the new setup would look like but I hope before I finish my day I have some WIP merge requests up so that you (or someone else) can weight in, which I'm totally for.
@eReGeBe I think the argument of it being a system vs application level configuration is a moot point. The only time a configuration file is really needed is when you need the settings before connecting to some kind of service (e.g. database credentials). Storing these settings in a configuration file is problematic because:
Omnibus needs to manage this file which means:
We have to add changes to Omnibus
Omnibus will effectively mirror certain settings (e.g. defaults)
Whenever we want to change a setting we have to edit some Chef JSON file, commit it, have it applied to all workers, reload the workers, etc. This can take considerable amounts of time
If average Joe wants to change the shard settings they have to start mucking in some YAML file and it's all too easy to mess things up (e.g. syntax wise)
If I want to quickly see what our shard configuration is I have to somehow pull the data from a worker or from Chef.
On the other hand when using a database whenever we want to change the settings all we'd need to do is:
Create a new shard in the shards table
Run UPDATE projects SET shard_id = X WHERE ...
This could be simplified by adding a UI to GitLab, potentially even with the ability to move projects between shards without having to run any manual commands.
If I want to see our current shards I just run:
SELECT * FROM shards
If I want to see what the shard path is for gitlab-ce I just run:
SELECT shard_path FROM projects JOIN shards ON shards.id = projects.shard_id WHERE id = ID_OF_GITLAB_CE; -- or use something like the path, you get the point
Instead if we're using a config file I have to either:
Note down the shard name from the projects table
Find the shard path in some random config file
or alternatively:
Start a Rails console
Get the project
Call some method to get the shard path
In general, using a DB to me is the most straightforward and user friendly approach (remember we're shipping this to everybody using GitLab, not just us) and doesn't require mucking with Omnibus, config files, Chef, etc.
Having a sharded paths is very unique case. If we store that in DB we are effectively moving the permission management and all checks for mountpoint from Omnibus to GitLab. Changing paths should not be taken reluctantly, also adding shards is not a something that happens often. This is something that we will need to tell and guide our users. I think that it's best for everyone to have a single place that does that automatically.
This seems also twisted from security perspective, because if someone finds a vulnerability that will allow him to change the shard paths in DB he can start trying to exploit this. I know that this is moot point, but basically then database your only point of truth and it's baiscally impossible to sanitize your paths and what you are accessing on your server.
As for Omnibus changes as @yorickpeterse said this is mostly rewrite of gitlab.rb of settings, and given that it is something like 10-20% of time to implement that compared to what needs to be done on GitLab Rails side.
What is less work to do regarding shards? using a table or omnibus?
I'm ok with both approaches, I kinda like to disconnect the path from the database because that would allow me to dance things around without having to update records at all, which means using configuration files, but I am open to whatever the majority decides.
I've created WIP merge requests https://gitlab.com/gitlab-org/gitlab-shell/merge_requests/61 and https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4578. These use the config approach but please don't take it as me trying to push that option, is just the code I started with and I wanted to put it out there to get feedback as soon as possible and see if I'm heading in the right direction. It's relatively easy to swap between both approaches in the current code for whatever we choose.
I'm close to finishing the refactoring part, I think. The only bump I found was that gitlab-shell wanted to know the repository path before calling /internals/allowed so getting the path there as we had planned was not gonna make it. I added a new endpoint for that.
Tomorrow I'll do the shard assignment/reassignment. Some recap and questions on that:
CE:
Random assignment (so, to clarify, if shardsis an array with all the paths, a shards.sample kind of logic?). Do we add a read-only field field somewhere (where?) to show a project's shard?
EE:
Ability to move projects from one shard to another. I think this would work as a new section called "Move repository" in the "edit project" page with a dropdown listing all the shards.
Smart assignment: So what do we actually mean by that? Check storage usage stats on all shards to choose? Round-robin?
Also, how to work on the EE stuff if this hasn't been merged into CE?
@eReGeBe I agree with @ayufan . Could there be a way around this? Ping me on Slack if you want to look at the gitlab-shell code together. I looked for 10 minutes but I don't see where we need the full path early on yet.
@eReGeBe Regarding random/smart assignment, at this point in time I'm ok having the ability to just declare "use this shard for now on" at the application level.
What I mean by that is that I want to stop creating repos in the same old storage as we have been all this time, and start using a new one.
@eReGeBe I understand that it should be a EE feature. But it may make sense to keep some of the plumbing in CE to make it easier to maintain the feature over time.
I made https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4657 for the applications settings bit as a separate MR for ease of review. In that branch you can actually use the feature and use several repository paths. I've been testing it myself and seems to work great.
We should decide which way we're going to go with where to save the shard paths (.yml or database model) to update the docs and omnibus if necessary.
As discussed in the last infraestructure call, https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4578 is now updated with a DB approach. There are a couple of things I'm not quite happy about, specifically the migration and testing (you'll notice I had to add stub_default_repository_storage to many tests for them to work), so any input there is appreciated. I integrated the changes of https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4657 into the former MR, so I closed that last one.
@sytses I would argue it should be "storages" as in the configuration key (repositories.storages) to indicate the plurality of the storage paths. But I can see the confusion between reading "storage" as "the action of storing" vs "the places to store". Let me know if my linguistic argument swayed you or if you'd still like it to be "storage".