Description including problem, use cases, benefits, and/or goals
Projects, especially with LFS, can grow very large. Storage space is limited and if you have a large instance growth can be hard to control.
Proposal
We will have a limit per project, per group and globally.
This should count total data of the project, including LFS storage size.
Specification
Users can set a limit to project size at a project level, group level and global level.
By default it's unlimited at every level.
When setting a limit at either group or global levels, we check if the group or the account has projects that are above this limit.
If there are, we warn the user about it but they can still save this limit.
This is a hard limit. That means we will block git pushto projects that are over the limit.
Administrators (and only them) can override both group and global limits by manually entering a value at the group level, or at the project level.
UI changes
admin/application_settings > Account and Limit settings: Add a project size limit field.
Add/Edit group: add ability to override the limit through the UI
admin/groups/<group-name>/edit, ability to change to override the global "per project limit" for this group
Error messages in the UI when there is no available space anymore:
Unable to commit through the UI
Unable to proceed with merging
List of projects admin/projects, if a global or goup limit is set, indicate the size of the project and the max allowed size.
Backend changes
We can't obviously put negative values for the project size limit.
By default, project size is set to unlimited.
Forks will not inherit the project limit if one is set in the original project.
New API calls to be able to override the default project size limit on a group level, and at the project level.
Error messages when doing git operations from the CLI if there is no available space anymore
Wireframes
Administration panel
Edit a group in the administration area (admin/groups/<group-name>/edit)
Creating/editing a group (groups/new)
List of projects
Design needed
List of projects: as shown in the wireframe above: next to each project in the admin view , indicate the space it takes as well as the available space (like 2.32MB out of 10Gb).
When setting a project limit globally in the admin panel, how would the warning message look like (You have 3 projects over the limit you are trying to set)?
Visual of a message that appear when we can't commit through the UI or proceed with merge (Your repository exceeds the limit of XXX and as a result you will be unable to push to this repository). I don't know if we have something for that already.
I think it does not make much sense to set one limit for all repositories in total. You can enforce such limit via partition or other way. I think we are talking about limit per project. Which will include LFS. So for example project cant be bigger than 20GB.
Do we already monitor the available space when doing anything through the UI and display error messages accordingly? Or is it something that we need to take into consideration and carefully implement with this release?
Do we have a minimum storage size (I mean, it can't be 0Mb)?
I suppose it's 1mb.
It would be nice to indicate a default value for this field - basically, our recommendation.
I think default should be infinite.
Do we already monitor the available space when doing anything through the UI and display error messages accordingly? Or is it something that we need to take into consideration and carefully implement with this release?
@regisF I think what is lacking is the admin => groups view, there should be an option to change the size as well (and report what the current global max is). Other than that, it looks good.
As an admin of a gitlab instance with a lot of students on it this is a really important setting for us and i like to see activity on this. Just some minor ideas from my view-point to make this really meaningful:
Admin should be able to set global per project size limit (default: unlimited)
means if nothing else is set, then this size limit is enforced for each individual project. (So this is not the sum of all projects but each project's max size. As you can set the max # of projects per user you can have an upper bound.)
What about existing projects? (would be cool if that setting auto applies so i don't have to go through 100s of repos and set it manually)
Each repo / group owner should be able to opt for even lower limits (less important, so additional feature for later maybe?)
admin / group owner might change settings further "up" in the hierarchy, so this lower bound would need to be re-checked live or any changes further up in the hierarchy should re-bound these user settings
Admin should be able to override global defaults for certain projects / groups
Precedence should be like this IMO (first that exists is enforced):
User project specific setting: remember this and min(this, eval_first_match(2-6)) (additional feature?)
Admin project specific setting
User group specific setting (EE, low prio): remember that and min(that, eval_first_match(4-6)) (additional feature?)
Thank you @joernhees for your analysis and questions. Again we want to keep things as simple as possible. We like to introduce features in their Minimum Viable state as defined in our Product handbook, and iterate on those features as we see the community use them more and more. I take a note about existing projects though and will update the issue's body accordingly.
@regisF fully understood... mainly wanted to clarify the different options and show how obvious (optional) extensions could be integrated.
Minimum viable for me would include (in the precedence numbers from above) 5 and 2:
5 (admin's global limit), as it's crucial for us to be able to limit somehow
2 (admin's project specific limit), as otherwise one can't set reasonably low default limits in reality without hindering some exceptionally large repos
example: student projects should by default be < 100 MB code only, but the LaTeX lecture with a lot of slides and pictures should be allowed to grow to 1 GB... If there only is 5, i'd need to grant all the students the same 1 GB?!?
The remaining 1, 3, and 4 would be nice, but they're not really crucial to implement an effective size limiting. Without 1 and 3, the precedence becomes a lot easier to implement and only needs an int per project, group and global.
@DouweM could we try to let the admin set a default minimum and set it in every project and if he needs less/more memory in a specific project he would set it manually in each project? eg: default 100Mb and 2 projects manually changed to 4Gb. Also if the admin already has a existing project with 3Gb we should not let the max memory go below that or should we set a custom max memory for that project only and set the requested admin default memory limit for the remaining projects?
Also if the admin already has a existing project with 3Gb we should not let the max memory go below that or should we set a custom max memory for that project only and set the requested admin default memory limit for the remaining projects?
@tiagonbotelho as an admin i'd actually be totally fine with all further repo actions being blocked on repos which are larger than my selected limit. I can quickly find them in the project overview sorting by size.
could we try to let the admin set a default minimum and set it in every project
I think it would be cleaner to implement this by actually spending 1 int per project, 1 per group and 1 globally.
could we try to let the admin set a default minimum and set it in every project and if he needs less/more memory in a specific project he would set it manually in each project?
We don't set a minimum. By default it's infinite. This feature will set a global maximum per project.
However you raise a good point: if we define a limit globally or per group, and if an existing project is bigger than the limit we are trying to set, we should warn the user that he must enter a bigger limit. I'll update the body issue with this use case.
Unless we also implement the ability to override the limit per project also, but I'm not sure if we should do that at this time @JobV .
instance limit also is implicitly given by # of users * # of repos per user * repo size limit
@elidevco If you want to restrict the total size the gitlab instance can use, put it on a dedicated mount point or a VM? It's a clear OS task to me, as i wouldn't trust any application to actually enforce that limit. It's a different question how graceful gitlab handles surpassing the free disk space though.
@regisF per project we can save for a future iteration, unless we manage to ship it in this release. But it makes sense. Only admin should be able to change it.
@job the biggest problem to me for CE is to check that every project does not surpass that memory threshold when the admin sets that value because that would mean that we needed to check the entire database to see the memory of each project no? @yorickpeterse what do you think?
When setting a max size globally or at the group level, if an existing project has a bigger size than the limit we are trying to setup, we should warn the user about that and prevent her to put the max size limit.
This will be tricky. Assuming projects.repository_sizeis accurate this could be used for a query such as:
However, chances are this query will take quite some time to run as the number of projects goes up. As such I think it's better to validate only upon pushing to a repository.
Part of this is also discussed in https://gitlab.com/gitlab-com/infrastructure/issues/302. We really need to start enforcing this in the coming months as we have a few too many repositories taking up a lot of data (see said issue for examples).
@yorickpeterse imagine that the repository with 124GB stated in the issue you linked us is innactive, it would not receive any warning whatsoever right? can't we make a side job to work with this query and we would email those projects regarding this issue?
@yorickpeterse The feature we are building here could be used for the infrastructure issue you mentioned. However that raises a new question: should this limit be a hard limit? I think it should.
@tiagonbotelho I still think that regardless of the time it takes to make the query, we should do it. Because that will let admins know if the limit they are trying to set is realistic. Otherwise, how would we deal with an account having a limit, and some projects being over limit. With this hard limit, how administrators deal with repositories that are beyond the limit they want to set is up to them.
So the question for you @yorickpeterse is: does making the query really take much time? If it does, we might have to change the flow a little bit in order to make sure the UI still works while we have to wait for the data to be displayed.
@tiagonbotelho Yes, we can still do something in the background (periodically or not). The key is to not delay the user submitting a form because the code is validating millions of projects.
The feature we are building here could be used for the infrastructure issue you mentioned. However that raises a new question: should this limit be a hard limit? I think it should.
Yes. Once you cross the limit you will not be able to push any new changes. However, I think you should be able to remove changes. This may be tricky to implement and it may be possible to abuse it somehow. As such as a start we can just implement a hard limit that blocks any pushes upon exceeding the limit.
So the question for you @yorickpeterse is: does making the query really take much time?
Getting a list of repositories larger than 1 GB takes around half a second:
gitlabhq_production=# explain analyze select count(*) from projects where repository_size >= 1024; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Aggregate (cost=106323.85..106323.85 rows=1 width=0) (actual time=467.497..467.498 rows=1 loops=1) -> Seq Scan on projects (cost=0.00..106307.10 rows=6698 width=0) (actual time=0.052..464.325 rows=5140 loops=1) Filter: (repository_size >= 1024::double precision) Rows Removed by Filter: 1157022 Total runtime: 467.525 ms
Note that this just counts the list of projects. If you want anything more (e.g. the columns) it may take a longer time to query the data and load it into Ruby.
Personally I think one should be able to increase the limit without being blocked by any existing repository sizes. Imagine the situation of GitLab.com: we deploy this feature and as such no limit is in place yet. Now we want to start enforcing a limit for any newly created repositories (e.g. 10 GB).
In the setup proposed we will not be able to do so until we first somehow remove or reduce the size of existing repositories. This in turn introduces a race condition. Once we have addressed the existing repositories we may still not be able to adjust the limit as repositories may have grown larger than our intended limit in the mean time. Given a popular enough GitLab instance this effectively prevents you from ever enforcing a limit.
In other words, the only way to prevent administrators from running in circles is to:
Just set the limit, ignoring any existing repository sizes
Enforce the limit upon a push
Periodically Email users that have repositories exceeding the repository size limit
What you could do is show a warning that still lets the user proceed. Having said that I don't think the warning "There are 120381902382093 projects exceeding this limit" is going to be very useful as it provides no information about which projects. Actually getting the project data in turn will not scale given a list of enough projects.
thanks for the input everyone. I am thinking of doing it the way @yorickpeterse suggested, and for the emails I think it's possible to make a job for finding the respective projects and emailing them no?
and we should probably make a blogpost about this, explaining to the general audience that we now impose a size limit in each project what do you think @JobV
@tiagonbotelho I'm not so keen on the idea of having GitLab instances sending emails on behalf of administrators, especially since they won't have any control over it.
I have the feeling that allowing projects to override the global/group limit would help solve exceptions. Can we do it for this release?
We should probably provide the list of projects that are over limit the moment we display the warning message (when we are trying to set the limit). That would let administrators warn the users themselves, with their own internal processes.
I'm not so keen on the idea of having GitLab instances sending emails on behalf of administrators, especially since they won't have any control over it.
We'll need some automated way of sending Emails as we will need it for GitLab.com. We shouldn't have to piece together a one time shell script of sorts for this.
I have the feeling that allowing projects to override the global/group limit would help solve exceptions. Can we do it for this release?
This wouldn't solve the loop problem. You may set a higher limit for existing projects, but there's nothing preventing new repositories being created before applying the global limit (and with those repositories exceeding the limit).
We should probably provide the list of projects that are over limit the moment we display the warning message (when we are trying to set the limit). That would let administrators warn the users themselves, with their own internal processes.
I don't think displaying this in the same message is a good idea as the list can be very large (thousands, if not tens of thousands). It's better to display these on a dedicated page that paginates the projects. Here a user could then take actions such as notifying the users, increasing a limit, whatever.
@yorickpeterse do we need emails in this context? I don't think so. Because how administrators will deal with projects that are over limit will be different for every company, depending on their internal policy. In that regard, what do you plan the email to say?
You may set a higher limit for existing projects, but there's nothing preventing new repositories being created before applying the global limit (and with those repositories exceeding the limit).
Ok what about this then? This feature:
will prevent new repositories to be over limit
will warn administrators if existing projects are over limit (allowing them to either set a higher limit, or to take actions to address repositories that cause problems before enforcing the limit)
if we allow projects to override the global limit, the warning is just about
the fact that all projects won't be able to push if they are over limit
the fact that they have projects over limit - not something that would block the limit to be set.
Then, administrators will have to go through the list of projects in the admin, to see which projects are over limit, and change this limit accordingly (if they want to).
@tauriedavis thanks for asking. Yes, it does, for two small but important things. Perhaps they don't need a specific visual treatment but I'd like your input.
List of projects: as shown in the wireframe above: next to each project in the admin view , indicate the space it takes as well as the available space (like 2.32MB out of 10Gb).
When setting a project limit globally, how would the warning message look like (You have 3 projects over the limit you are trying to set)?
do we need emails in this context? I don't think so. Because how administrators will deal with projects that are over limit will be different for every company, depending on their internal policy. In that regard, what do you plan the email to say?
I was thinking of something simple along the lines of "Your repository exceeds the limit of XXX and as a result you will be unable to push to this repository. Please contact your GitLab administrators for more information".
will prevent new repositories to be over limit
Every repository, existing or newly created, should honour the limit.
Should we also block all git operations from the UI when we are over limit (merge, branch, edit a file, commit)? I think we should, just want to confirm.
What happens when we fork a project? Do we keep the limit if it is set at the project level? I'd say limit should be applied to the fork as well. Confirm?
About the emails that are sent:
does it require a new port to be opened in instances in order to send the emails?
who should be the from email address?
can we let the administrators configure the default message?
Should we also block all git operations from the UI when we are over limit
Yes.
What happens when we fork a project? Do we keep the limit if it is set at the project level?
I think forks by default shouldn't inherit the limit. If we inherit the limit one could abuse the service by forking a very large project (with a custom limit) a lot of times.
@yorickpeterse so far that is what I did :) I am now blocking issues/MR's/Files if you have anyone has any suggestions on how to refactor my code help will be appreciated!
Are we limiting the whole project's size? Like, do we also count its size in the DB (issues,MRs, etc.)? Do we count the wikis as well (they are Git repos after all)? If this is only about the repo.git size on disk, then the the feature should be renamed from "Project size limit" to "Repository size limit" to better reflect our terminology.
can we let the administrators configure the default message?
I don't really see why this would be necessary. A stock text should suit most users.
One other thing worth considering though: Do we want to send thousands of emails when we have many 100s of thousands of projects? This is a way to for email bomb an instance. I think we should consider focusing on: 1. printing a message on git push and 2. having a sort of banner that user can click X on to acknowledge. Only think about emails if really necessary at some point.
@marin good point about this emailing bomb. One more argument against it, yes! :-)
The need of having emails emerged at one point in this issue but we need to take a step back here. Why do we want to send an email in the first place? To warn project owners. To do that, we will have git push error messages, so users will know when there are over limits.
A banner to warn users of a repo that is over limit is another great idea, @marin.
To sum up, here are what will have for users to be aware of the limit of a repository:
@regisF I gave the UX a go and I came up with these:
Project list
For the project list we already have a badge with the project size, so I took those and gave them a different style. If a new limit is set and there are projects that exceed that limit, they are shown with a red badge in the list. We can also show projects that are close to reaching the limit in orange.
We could include a new option to the sorting dropdown that sorts projects according to how much of their allotted space they have used. That way you could see all the projects that exceed their limit at the top.
Edit project
I added a dropdown to the size field so users don't have to calculate the size they want in MB.
Edit project warning
I thought the style of the warning box at the top of the page was a good way to show an important message, but having two orange boxes in the same page was going to defeat the purpose of even having them. So I changed the original box and gave it the style of the warning/tip box we have in other settings pages. If it's important to keep the original style I can look for a different approach.
I included the size of the projects in the warning message, so the admin can see that by adding another 100 or 200 MB they could accommodate all projects, for example.
Project warning
When your project exceeds the limit you see a flash message that lets you know what happened and how to solve it. Maybe the suggestions I included aren't accurate but I think we should let users know how they can get out of this situation.
We could also show a message when you are close to filling up your space so you can do something about it before you go over the limit.
CE would have a global limit, whereas EE would allow group level limit as well as per project limit.
The concern is about providing only a global limit (on CE):
projects that are over limit can't be pushed to
users can't reduce the repository size with git itself and push the new, reduced code because they won't have the right to push or force push at all. Git is different than a traditional file system where we can delete files and regain free space.
we can't know in advance if the code that is pushed will make the project become over limit.
therefore once a project is over limit, there is no way to recover from this situation, unless the administrators decide to raise the limit for every project in the instance.
a project over limit can't accept new code. If users can't push new code to a repository, that can be prejudicial for them (lot of work lost etc...).
if the project owner moves his repository to another Git host, she will lose all the issues and commits related to the code.
that will lead to frustration for the users and therefore for the administrators because of the pressure of these users.
These are pretty important concerns.
One could say that the benefit of providing a global quota per project outweighs the concerns mentioned above, as long as we provide information on the project itself about its size (which we don't in the current state of this issue), so the user is aware of the space available. I think many organizations (especially schools and universities) will like to be able to have project quotas at an instance level.
@regisF I prefer to prioritize the EE-only side of this and leave this as is or accepting MRs. I don't really see the point of shoehorning a CE feature when it's not necessary, yet creates potential problems.
Have you considered implementing "soft" and "hard" quotas, roughly similar to the way in which they've been implemented on multi-user UNIX systems? In summary, this means you have a threshold beyond which users are warned, "Hey, you're approaching your quota limit of X. Please archive or delete some files. If you hit the limit, you will not be able to push any new changes to the repo." I am pretty sure my user base would respond warmly to this, and I'd especially like to be able to adjust the warning threshold at a group or repo level.
As you are no doubt aware, removing files from a Git repository (or LFS endpoint, for that matter) isn't especially straightforward with current tooling. I look forward to seeing how Gitlab handles this either within the app itself or a really well-written help page that quota offenders get referred to as they approach their limit.
Another benefit of implementing a warning threshold is it gives you another set of events that can trigger logic (adding a "Shrink this repository!" todo event to devs / maintainers? If you wanted to regulate pushes this would also be a great time to add/change/remove a Git repo hook that can intelligently handle "near-quota" or "at-quota" behavior).
Sorry to butt in on this but I thought I'd offer my two cents as an EE admin dealing with large LFS and non-LFS repos. I love this feature, I want this feature, and I will use the heck out of this feature however you end up implementing it.
@regisF i'm not sure if i understand your comment correctly... as MVP was mentioned before: as an admin of a self hosted CE instance the following would work for me "for now":
global limit
if surpassed, user can't help himself anymore, forces them to come to me for a chat
per project limit:
allows me to increase their projects limits a bit, so they can for example force push old stuff away
Obviously long term it's desirable to have soft and hard quotas (as mentioned by @leftathome), and a way for users to help themselves (shrink their repo sizes) once they surpassed any of the limits, but i guess that can wait.
Also note that it might be pretty difficult to reduce repo sizes later on, as currently your merge request references might effectively hinder that (even for admins, even with direct gitlab instance disk access): https://gitlab.com/gitlab-org/gitlab-ce/issues/19376
that's pretty unfortunate for all universities... any way for you to reconsider moving this into CE? Without this, every hosted CE instance is basically DoS vulnerable to every single user who can create / push to a repository.