Provide multiple git mount points so we can split NFS drives

After a quick dive in the code by @ayufan it looks like it would not be extremely hard to do it.

We would still need to look for a lot of edge cases, but since repository management is not spread across application it seems that this could be done.

The interesting thing is that this would allow us to do smooth transition from one storage to another.

Exactly, and will buy us so much time that could actually stop and think :)

I think this is a good idea; Elasticsearch does exactly this.

@DouweM @JobV I think we need to start investigating this ASAP.

+1, complexity should be modest and it would buy us room to breathe in the short term. I think it is suitable for 2x-4x growth (i.e. 2-4 NFS servers), not for 10x. To make it sustainable in the medium term we would need a solution that can keep going when one storage server is out (i.e. with partial availability of repos).

The more I think about this, the more I like it.

I was thinking that we can do a really small experiment once we have this and is to actually test stuff in production. We could have our GitLab organization repos re-mounted into a ceph or lustre partition and check how it works while we work with it.

We will need to be able of separating what metrics belong to one mount point and what metrics belong to a different mount point because I would love to be able of comparing apples to apples at the application level from performance.gitlab.net.

So maybe we need to do tagging here @yorickpeterse?

@pcarranza How would the mount point be exposed to the application?

@yorickpeterse

Right now we have:

production:
  gitlab_shell:
    repos_path: /home/git/repositories/

I see this implementation like this:

I suppose that we would have something like this (or similar) in gitlab.yml:

production:
  repository:
    storages:
      default: /home/git/repositories/
      nfs2: /home/git/repositories2/

We would add a column repository_storage to projects (the null value of column would be default).
We could add to application_settings an option to select a storage for a new projects, we would show: default, nfs2.
The backup and restore would require all storages to be present, because it would restore to storage defined in database.
In the future we could add a rake task that would do online migration of projects/groups to a different storage.

@ayufan what if we put the 'storage shards' in the DB only? Projects get a 'storage_shard_id' column. The 'storage_shards' table has id, path. To find a repo do:

File.join(Gitlab.config.gitlab_shell.repos_path, project.storage_shard.path, project.path_with_namespace) + '.git'

I don't see why we need to couple this to backups. To GitLab, a storage shard would just be a directory. What is mounted at that directory is the problem of the OS.

The default storage shard would have path '' so that existing repos are found.

mentioned in issue gitlab-com/operations#1 (closed)

@yorickpeterse what do you mean? has this question been replied by the ongoing conversation already?

@pcarranza What I meant was "How does a project know which path its data is stored in". This question is more or less answered above.

I think @jacobvosmaer-gitlab his proposal in https://gitlab.com/gitlab-org/gitlab-ee/issues/583#note_11994212 makes sense.

And as @jacobvosmaer-gitlab this helps to scale 3x and to move to new storage. We'll still need a properly distributed solution that includes redundancy and rebalancing in the end.

I foresee problems with rolling this on our own but it looks like we have no choice due to the NFS server already redlining.

@DouweM and @JobV We should get started on this for 8.9. This is high priority for GitLab.com.

Milestone changed to %8.9

@jacobvosmaer-gitlab Seems nice, but I'm not big fan of storing the shard_paths in DB. For me it feels like a system level configuration, rather then application level configuration :)

Let's imagine that you will have that configured in Omnibus, which can:

Pass to GitLab Rails app,
Make sure that all permissions are correct,
Make sure that mountpoints are correct.

If we put that in database it will be harder to fulfil the 1. req :)

@ayufan I propose to only allow relative subdirectories for storage shards. For example /home/git/repositories/-shard1 (I am hoping -shard1 is an illegal namespace path in GitLab so we cannot have clashes). That way we only need /home/git/repositories to guard access to the repositories, which is what we already do. I think that addresses your point 2.

I think perhaps mount points (point 3) should be none of GitLab's business. The OS can handle this transparently.

If I recall correctly @northrup said he would work on this during yesterday's '8.9 Kickoff meeting'.

One thing to look for is where the string repos_pathis used in the gitlab-ce code base. This will guide you to code that wants to know the path on disk to an actual repository.

As @ayufan pointed out yesterday, gitlab-workhorse should require zero changes. Gitlab-shell would have to be told by /api/v3/internal/allowed what the exact path to a project is when handling Git-over-SSH.

Then there is the question how to migrate repositories between storage shards. In the first version this can probably be a manual process (no automatic rebalancing etc.).

Another thing we need to thing about early is where to implement all this. On the one hand supporting large amounts of storage sounds like an EE feature. On the other hand at least some of this has to be handled in CE already because otherwise we would have a constant source of 'repository not found' bugs in EE.

I think that basic piping can be in CE, but enabling the feature itself and managing it from the app should be in EE.

This seems to be much more a EE feature than CE as you said @jacobvosmaer-gitlab

I think we will be at a constant risk of contributed code ignoring the storage sharding and breaking EE. The less of the functionality is 'real' in CE, the higher the risk.

Should it all be there in CE then?

Paging @JobV :)

I agree with @jacobvosmaer-gitlab, it will be easier to maintain the code if this (in most parts) will be in CE. Maybe this should be hidden feature in CE (with really ugly configs), but really surfaced in EE (with nice configuration, health check of storages, easy to use migration between storages, etc.)?

One thing we could do:

CE has all the 'shard lookup' code
CE creates at least one non-null shard by default
CE randomly puts new projects in the null shard (/home/git/repositories) or a non-null shard (/home/git/repositories/-shard1). This forces the CE code to correctly handle 'legacy projects' (null shard) and projects stored in a shard.

Then EE adds:

tooling to move projects between shards
smart (non-random) shard assignment for new projects
etc.

I don't need to be paged when @jacobvosmaer-gitlab has awesome ideas. I'll update the issue body.

Sounds like a plan.

I'm not sure about putting everything in /home/git/repositories, because it creates a lot of mess. We will have a repos in groups lying in top level directory, but we also will have directories for shards in which we will have a repos in groups.

This also introduces not nice issues with mounting:

You always have to mount /home/git/repositories first.
Only then you can mount the /home/git/repositories/-shard.

In your externally hosted /home/git/repositories you have to have a -shard directory.

I agree with @ayufan, hosting git repos in the same directory as shard directories for mounting is a recipe for a hot mess, can we have /home/git/repositories/-local and /home/git/repositories/-shard[1-n] ?

Reassigned to @pcarranza

I like separating the mountpoints also.

I'll try to summarize what was said and get some sense out of all this in the description later.

@ayufan @northrup I see the point of separating the mountpoints.

What worries me is how to transition existing installations to a sharded layout. We cannot force everybody to change their mountpoints from e.g. /home/git/repositories to /home/git/repositories/-local . We also cannot quickly move all repositories into a subdirectory like that.

It would be bad if 'legacy' GitLab installations have everything under /home/git/repositories, and a handful of 'sharded' installs have /home/git/repositories/-shard1 etc. Then we will constantly see code contributions coming in that only works for the 'legacy' layout, which is a maintenance nightmare.

@ayufan @northrup any thoughts on how to achieve:

storage sharding
not moving tons of repositories
no 'change your mountpoints' intervention required for majority of users => changing mountpoints cannot be automated by us for everybody
every GitLab installation shards by default so that changes that break sharding are found early

Had a call with @pcarranza and @ayufan today. We think the best way forward is something like @ayufan's initial proposal (storage shards can be anywhere, gitlab.yml maps shard names to paths on disk) and monkey patching the Gitlab.config.gitlab_shell.repos_path method to raise an exception when developers try to write code that looks up repos on disk without going through the new 'shard' API.

Since this spawns multiple projects I created https://gitlab.com/gitlab-com/operations/issues/339 to keep track of everything.

My first question is regarding the interaction between gitlab-shell and ce/ee. As I currently understand it (bear with me as I get familiar with the infrastructure ;) there are two main flows:

GitLab CE/EE calls gitlab-shell (mainly using the GitLab::ShellAdapter) in several steps of its workflow. In these cases I think what we have to do is modify our calls to the shell to give it the real path accounting for shards, and modify the shell to use this full path.
gitlab-shell gets called from git over ssh. This, if I understand correctly, is what @jacobvosmaer-gitlab was talking about, where shell must call /api/v3/internals to know what's the shard of the project.

Am I seeing this correctly? I think this would mean that gitlab-shell would loose the repos_pathkey in its config.yml (and not have the new shard configuration replicated in its config), since now it will always get it from CE.

@eReGeBe Yes, I believe that is basically right. Right now gitlab-shell sends a POST /api/v3/internal/allowed to check for permissions. It looks like other Rake tasks use repos_path (e.g. backup/restore/import), so we would need to figure out how to deal with those cases too.

Mind this @eReGeBe

When a user makes a push through ssh, the ssh endpoint has this authorized_keys setup:

command="/opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell key-2",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-rsa BASE64-KEY-YADDA-YADDA

So we are using gitlab-shell from ssh, there we call the allowed api endpoint and ask for the project name for the given key. The path for the project should be returned from this endpoint.

@pcarranza noted.

Now a question about namespaces: We had discussed adding the new field to indicate the shard (I'm calling it repository_storage for the moment, but we can change it) on projects, but namespaces also interact with gitlab-shell regarding the file system to ensure_dir_exist, move_dir and rm_dir.

I can think of two options:

Add the repository_storage field to namespaces instead. This would require all projects in a namespace to be in the same shard.
Move the directory logic to projects. This means:
The project ensure_dir_exist on before_create.
Moving/deleting a namespace implies iterating over each project of that namespace and having them do the mv, checking if another project on the same shard didn't do it already.

Thoughts?

@jacobvosmaer-gitlab ^^ ?

You have way much more context here, could you please chime in.

Difficult decision. I think setting shards at the project level is better (not forcing an entire namespace to live on the same shard). But it makes the change we are doing here harder because we have to move the directory logic to the project.

I think putting directory logic in the project is the 'right' thing to do but it is hard to oversee the impact.

I agree with @jacobvosmaer-gitlab.

The main problem with namespaces and projects is that in a few places we work on namespace level. Where we actually move a group of projects. It shouldn't be that hard, since this exclusively is done in GitLab Rails. In case of these methods (they are implemented in Gitlab::Shell which for some operations calls gitlab-shell, but not for these):



    # Add empty directory for storing repositories
    #
    # Ex.
    #   add_namespace("gitlab")
    #
    def add_namespace(name)
      FileUtils.mkdir(full_path(name), mode: 0770) unless exists?(name)
    end

    # Remove directory from repositories storage
    # Every repository inside this directory will be removed too
    #
    # Ex.
    #   rm_namespace("gitlab")
    #
    def rm_namespace(name)
      FileUtils.rm_r(full_path(name), force: true)
    end

    # Move namespace directory inside repositories storage
    #
    # Ex.
    #   mv_namespace("gitlab", "gitlabhq")
    #
    def mv_namespace(old_name, new_name)
      return false if exists?(new_name) || !exists?(old_name)

      FileUtils.mv(full_path(old_name), full_path(new_name))
    end

It should be fine to replay this operation on all shards, rather then working on specific one.

I had a call with @pcarranza about this, then read through the issue to see the proposed solutions. I think @jacobvosmaer-gitlab's idea of storing things on DB level is the best approach as it allows easy shard management without having to do the whole Chef/Omnibus dance (also saves us from adding code to Omnibus to manage config files, etc).

It may have been suggested already but the easiest setup I can think of is as follows:

The table repository_shards contains data about each shard, in particular:

The ID
The name: a simple string (e.g. shard1), mostly for debugging purposes
The root path containing the directories (e.g. /opt/gitlab/shards/shard1), in here a set of repository directories can be stored

Path wise this means you'll end up with something like:

. /opt/gitlab/shards/shard1
|_ gitlab-org
  |_ gitlab-ce
. /opt/gitlab/shards/shard2
|_ gitlab-org
  |_ gitlab-ee

Note that it should be possible for repositories (in the same group) to span multiple shards. This makes moving repositories much easier as you can move them individually instead of having to move an entire group together.

Structure wise this would be (code is from the top of my head, the syntax may be invalid):

CREATE TABLE shards (
  id serial,
  name text not null,
  path text not null
)

For the projects table we then add a shard_id column which just points to the shards table. In the Project model we then make sure there's a method that returns the full path to the repository (either using the shard or a default path from gitlab.yml). This path can then be used by methods such as Project#repository to initialize any Rugged objects or work with gitlab-shell.

gitlab-shell in turn would have to get the shard path (or the full repository path, even easier) from the Grape API. @pcarranza mentioned adding this to the authorized keys API but this API only exists in EE so I think we'd have to either move this to CE or use a separate API. Either way it gets it from the API instead of using a path hard-coded in the config.yml file.

@pcarranza also mentioned that shards should be checked upon boot and that a boot should fail if a shard doesn't exist. I'm not sure if we should do this right from the start, but it's worth keeping in mind. If we do this we have to make sure the code in question works when the DB is disabled (e.g. when the USE_DB environment variable is set to false).

We will also need a way to lock repositories for writes. This allows us to migrate repositories between shards without having to worry about any Git changes. This can be something as simple as adding a allow_repository_writes column (boolean, default true) to projects and setting this to false when migrating a project. The Grape API call used to get a repository path can then raise some kind of error saying "Sorry, this repo is in read-only mode". Note that this would allow to both regular and wiki repositories, unless we want separate columns for both.

Finally, I discussed with @pcarranza that it's probably best for me (unless somebody else wants to jump in) to take care of gitlab-shell and tying this to Rails so @eReGeBe can focus on the Rails side of things.

Any thoughts?

p.s. I may have overlooked some suggestions. Keep in mind that I'm not dismissing any suggestions, I may have honestly just been dumb and missed them :)

@yorickpeterse: I'm with @ayufan in that this feels more like a system level configuration than an application level configuration and prefer to have the shards paths in a config file. This would make it more straightforward to do the boot-up checks that @pcarranza talks about (it would be a small change in what lib/tasks/gitlab/check.rake currently does). I also think it's weird that depending on the case you'd get the shard path from the DB or from gitlab.yml (if you need to use the default one) and I think it would be nicer to have it all in one place. However, I am not aware of the weight and pain of implementing such config changes in Omnibus so I don't have that knowledge to weight it against, and that may make the balance change.

I already have some change in gitlab-shell that I've made to get a feeling on how the new setup would look like but I hope before I finish my day I have some WIP merge requests up so that you (or someone else) can weight in, which I'm totally for.

@eReGeBe I think the argument of it being a system vs application level configuration is a moot point. The only time a configuration file is really needed is when you need the settings before connecting to some kind of service (e.g. database credentials). Storing these settings in a configuration file is problematic because:

Omnibus needs to manage this file which means:
We have to add changes to Omnibus
Omnibus will effectively mirror certain settings (e.g. defaults)
Whenever we want to change a setting we have to edit some Chef JSON file, commit it, have it applied to all workers, reload the workers, etc. This can take considerable amounts of time
If average Joe wants to change the shard settings they have to start mucking in some YAML file and it's all too easy to mess things up (e.g. syntax wise)
If I want to quickly see what our shard configuration is I have to somehow pull the data from a worker or from Chef.

On the other hand when using a database whenever we want to change the settings all we'd need to do is:

Create a new shard in the shards table
Run UPDATE projects SET shard_id = X WHERE ...

This could be simplified by adding a UI to GitLab, potentially even with the ability to move projects between shards without having to run any manual commands.

If I want to see our current shards I just run:

SELECT * FROM shards

If I want to see what the shard path is for gitlab-ce I just run:

SELECT shard_path FROM projects JOIN shards ON shards.id = projects.shard_id WHERE id = ID_OF_GITLAB_CE;  -- or use something like the path, you get the point

Instead if we're using a config file I have to either:

Note down the shard name from the projects table
Find the shard path in some random config file

or alternatively:

Start a Rails console
Get the project
Call some method to get the shard path

In general, using a DB to me is the most straightforward and user friendly approach (remember we're shipping this to everybody using GitLab, not just us) and doesn't require mucking with Omnibus, config files, Chef, etc.

I'm not sure.

Having a sharded paths is very unique case. If we store that in DB we are effectively moving the permission management and all checks for mountpoint from Omnibus to GitLab. Changing paths should not be taken reluctantly, also adding shards is not a something that happens often. This is something that we will need to tell and guide our users. I think that it's best for everyone to have a single place that does that automatically.

This seems also twisted from security perspective, because if someone finds a vulnerability that will allow him to change the shard paths in DB he can start trying to exploit this. I know that this is moot point, but basically then database your only point of truth and it's baiscally impossible to sanitize your paths and what you are accessing on your server.

As for Omnibus changes as @yorickpeterse said this is mostly rewrite of gitlab.rb of settings, and given that it is something like 10-20% of time to implement that compared to what needs to be done on GitLab Rails side.

@yorickpeterse I think we discussed using the /allowed endpoint to get the path.

What is less work to do regarding shards? using a table or omnibus?

I'm ok with both approaches, I kinda like to disconnect the path from the database because that would allow me to dance things around without having to update records at all, which means using configuration files, but I am open to whatever the majority decides.

mentioned in merge request gitlab-shell!61 (merged)

I've created WIP merge requests https://gitlab.com/gitlab-org/gitlab-shell/merge_requests/61 and https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4578. These use the config approach but please don't take it as me trying to push that option, is just the code I started with and I wanted to put it out there to get feedback as soon as possible and see if I'm heading in the right direction. It's relatively easy to swap between both approaches in the current code for whatever we choose.

@eReGeBe I scanned through your CE changes briefly and thus far it seems to be on the right track.

I'm close to finishing the refactoring part, I think. The only bump I found was that gitlab-shell wanted to know the repository path before calling /internals/allowed so getting the path there as we had planned was not gonna make it. I added a new endpoint for that.

Tomorrow I'll do the shard assignment/reassignment. Some recap and questions on that:

CE:
Random assignment (so, to clarify, if shardsis an array with all the paths, a shards.sample kind of logic?). Do we add a read-only field field somewhere (where?) to show a project's shard?
EE:
Ability to move projects from one shard to another. I think this would work as a new section called "Move repository" in the "edit project" page with a dropdown listing all the shards.
Smart assignment: So what do we actually mean by that? Check storage usage stats on all shards to choose? Round-robin?

Also, how to work on the EE stuff if this hasn't been merged into CE?

The merge requests have been updated. Thoughts?

I added a new endpoint for that.

This seems like a bad idea, because they we have to do 2-3 requests when doing a push, which in the end will put more pressure on Unicorn.

Maybe we can refactor that to use a path returned from allowed?

@eReGeBe I agree with @ayufan . Could there be a way around this? Ping me on Slack if you want to look at the gitlab-shell code together. I looked for 10 minutes but I don't see where we need the full path early on yet.

@eReGeBe Regarding random/smart assignment, at this point in time I'm ok having the ability to just declare "use this shard for now on" at the application level.

What I mean by that is that I want to stop creating repos in the same old storage as we have been all this time, and start using a new one.

Then we can start moving things around as needed.

@pcarranza "use this shard for now on" as an option in EE or also in CE?

@eReGeBe I understand that it should be a EE feature. But it may make sense to keep some of the plumbing in CE to make it easier to maintain the feature over time.

I made https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4657 for the applications settings bit as a separate MR for ease of review. In that branch you can actually use the feature and use several repository paths. I've been testing it myself and seems to work great.

We should decide which way we're going to go with where to save the shard paths (.yml or database model) to update the docs and omnibus if necessary.

As discussed in the last infraestructure call, https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4578 is now updated with a DB approach. There are a couple of things I'm not quite happy about, specifically the migration and testing (you'll notice I had to add stub_default_repository_storage to many tests for them to work), so any input there is appreciated. I integrated the changes of https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4657 into the former MR, so I closed that last one.

https://gitlab.com/gitlab-org/gitlab-shell/merge_requests/61 was also updated because I had missed a few places to refactor (woops).

In today's call we chose to backtrack and give it another go at the .yml approach. @pcarranza will do a write down of our reasoning shortly. https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4578 is updated with the relevant changes applied to it. https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/4657 is reopened and updated with the configuration changes. That MR's branch is the one you should check out if you want to test the feature in full. I created https://gitlab.com/gitlab-org/gitlab-ce/issues/18892 to discuss the changes to omnibus that are needed.

mentioned in issue omnibus-gitlab#1362 (closed)

Thanks @eReGeBe

This is our reasoning:

Using SQL

In favor:
- easily adding shards
- not touching omnibus
- adding shards is easy and simple
Against:
- Tightly couples the server filesystem layout to the database - restoring a backup will require all the folders to be in the same place.
- Adding shards still requires system engineering to happen in the host anyway

Using Configuration (yml)

In favor:
- FS layout coupling is gone.
- Full paths are not being stored in DB anymore
- consistent with how LFS and CI artefacts are stored in the DB
- We don't plan to do a lot of shard adding (and we can revisit then)
- Adding a shard is a form of deployment anyway
Against
- Configuration is going to be split in 2 places (shard setup+label) so we need to guard ourselves from an invalid setup.

Milestone changed to %8.10

Reassigned to @eReGeBe

mentioned in commit gitlab-shell@710746ff

https://gitlab.com/gitlab-org/gitlab-ee/issues/759 was created for the implementation of the moving of projects between shards

Milestone changed to %8.11

Added feature proposal label

Milestone changed to %8.12

Added ~481018 label

Milestone changed to %8.13

Milestone changed to %8.14

Milestone changed to %8.15

@eReGeBe What is the status here?

changed milestone to %8.16

@DouweM I think this can be closed now /cc @pcarranza

@eReGeBe where is the documentation for this?

@eReGeBe found it under https://docs.gitlab.com/ee/administration/repository_storages.html can you please remove the trailing s? should be https://docs.gitlab.com/ee/administration/repository_storage.html

@sytses I would argue it should be "storages" as in the configuration key (repositories.storages) to indicate the plurality of the storage paths. But I can see the confusion between reading "storage" as "the action of storing" vs "the places to store". Let me know if my linguistic argument swayed you or if you'd still like it to be "storage".

changed milestone to %8.17

added Deliverable label

@eReGeBe Let's rename it Repository Storage Paths, which is also the wording used in the doc itself.

closed

@eReGeBe @DouweM

Were these bits from the description actually done?

Then EE adds:

tooling to move projects between shards

smart (non-random) shard assignment for new projects

It doesn't appear to be documented.

@mydigitalself

tooling to move projects between shards was implemented on https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/533/ through the projects API. It's not obvious at first that this is possible. The API documentation mentions the argument repository_storage, which is the one used for this purpose https://docs.gitlab.com/ee/api/projects.html#edit-project. There's also https://gitlab.com/gitlab-org/gitlab-ee/issues/1256 regarding making a more user friendly interface.
smart (non-random) shard assignment for new projects was not implemented AFAIK. We did round robin on CE and that was deemed enough https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7273 This is mentioned on the admin documentation https://docs.gitlab.com/ce/administration/repository_storage_paths.html#choose-where-new-project-repositories-will-be-stored

Provide multiple git mount points so we can split NFS drives

Proposal

Original issue

Designs

Child items ...

Activity

Admin message

Admin message

Provide multiple git mount points so we can split NFS drives

Proposal

Original issue

Activity