Following the discussion from here: https://gitlab.com/gitlab-org/gitlab-ce/issues/26897 it seems that artifacts do grow significantly and it is also harder and harder. We should use CarrierWave and feed that with possibility to point to external storage (S3/AWS).
Documentation blurb
As companies continue to embrace CI/CD across the organization, their artifact storage needs naturally increase as well. With GitLab 9.3 we are proud to announce that CI artifacts can now be saved to object storage, like Amazon S3 or Google Cloud Storage. Leveraging these cloud services enables artifacts to be saved cost effectively, reliably, and with nearly infinite scalability.
@pcarranza@markpundsack
I expected this issue to be a one day fix, however, things are a little more complicated than I hoped.
Right now, we can pick one of 2 routes:
Move all file to object storage
Archive artifacts after 30 days to object storage
The first option has as downside that the 'Browse' feature on artifacts won't work anymore. This is the case because the metadata is needed locally to determine the tree to browse. The simple solution seemed to be to keep the metadata.gz locally, the artifacts.zip on S3 and be done with it. However, this is not an option, again, as that you can't download individual files as workhorse needs access for this to the .zip. We could in all cases download the artifacts.zip files, but doing this for each request doesn't seem to be a rugged and thus viable option.
Archiving the files after X days would be a logical 2nd option, but this only partially solves this issue as we still need NFS storage for 30 days of artifacts. Also, this flow is not a common one, so Carrierwave does support it but this gem might break the compat for our flow as we need to use tricks to circumvent the limited API (for this use case) or contribute upstream to supply this.
The first option I tried already, hence its a little more detailed, the second one might have some problems I do not know yet. But before spending to much time now and someone wanted to go another route I'd like your opinions on this.
I think that browsing artifacts would be still useful, even if we won't be able to download single file. Would it make sense to persist metadata objects that are really small on NFS and move only artifacts to S3?
Ok, to give some image of urgency here I'll share what's going on from the production side of things.
We currently have 1 server that we are trying to slice down
This server has approx 27TB stored, is receiving a bit over 1TB per week, and has up to 33TB total of space, mapping to the 6 weeks that @ernstvn talks about.
This means that it is 82% filled.
The main source of this server is coming from artifacts, in the effort of taking down this server and slice it down we copied them all to a new server (artifacts01) and discovered that as of last week we had almost 24TB of these - so our assumption is that the main load is coming from here.
We can still attach some more drives to this host (I think that up to 48 but don't quote me on this, we are in 32 already) but after that we have a hard stop for artifacts growth. Given that we are in fact accelerating in adoption I think we need to seriously take care of eviction of old artifacts and allow infrastructure to move out of NFS servers into an object storage that we can actually scale or even outsource to Azure object storage or S3 (removing the problem completely and making it even cheaper)
Next inline problem
Additionally to this, and another thing we need to review is that we have been copying the pages folder is taking 18 days so far (the joys of small files), I expect it to take another 2 weeks at least (we are currently copying files from 201607, and as we move forward in time there are just more files)
For this problem we need to evict old stuff - a cleanup process.
I will come back with more data as soon as we can.
The underlaying issue here is that we seriously need to consider what are we doing with files, because it's reaching an unmanageable size.
I did came up with https://gitlab.com/gitlab-org/gitlab-workhorse/merge_requests/148, but it is not working as expected. It will probably work, but the more tricky is code on rails side to "interact" with CarrierWave. CV is not designed to have cache stored locally, and data send asynchronously. The current implementation makes the uploading to ObjectStorage to happen in Unicorn.
@ernstvn@ahanselka This is not going in this release, as we don't have a solution yet where we can offload the heavy lifting to anything else but unicorn.
We've investigated both changing workhorse, or the runner. Sadly, non of both seems viable at the moment, but Ill discuss with Kamil somewhere next week.
OK, please keep chipping away at it, since moving to object based storage is something we will need from the Production infrastructure side of things / to control costs.
Just to clarify, is the "heavy lifting" a one-time thing as we do the move? Or is it an ongoing heavy lift afterwards? Are we OK with unicorn handling the "steady state" flow after the move to OBS is complete? If so... can we consider beefing up unicorn for the transition?
Just to clarify, is the "heavy lifting" a one-time thing as we do the move?
No, that would be the actual uploading of 1 artifact.
The reason I wanted to perform this with unicorn is, it seemed the simplest and even the most boring solution, but lets consider the following case:
Runner is on DO
Server on Azure
S3 is the object store
Within 60 seconds we have to download the full file, do some minor computation and reupload it the object store. If we time out, because the file is xGB, the runner is actually going to retry this twice more. Added bonus, this is all done on TCP to different datacentres. Given we could have multiple runner doing this at the same time, this could kill .com. We've briefly looked at a gem which only downloaded it in unicorn after which it would push it as a job to sidekiq. Eventhough the gem seemed to be of decent quality it looked unmaintained and I don't think we should 'adopt' this gem only for this problem.
What do people think about reserving this an gitlab-ce~2278657 feature? Since storage is strongly correlated with project and user counts, it aligns nicely with our pricing plans.
SMB's and trial users would likely want to stay with local storage anyway since that is the boring and simple solution, and their storage needs are easily met.
Then the companies with storage demands large enough to make the extra complexity worth it would naturally be our target EEP base. Plus I believe this has nice synergies with GitLab Geo, and that is EEP already anyway!
@joshlambert That depends on the changes we need for the runner and workhorse I believe, we don't release EES or EEP versions for those yet. Maybe we should first come up with a plan to implement this, and that get back to this?
Thanks for the quick response @zj. One aspect I didn't see is where this is configured. Is it in the admin UI, gitlab.rb, or somewhere else? If it is in the UI, we could then just not expose it and leave all the other functionality available.
I don't think that would be wise, because changing the values might break all artifacts, downloading and uploading, potentially breaking all new pipelines as well. Also, we don't provide a way to move all the files either, so the settings shouldn't be too easy to set imo.
Thanks @zj. There is certainly still the open question of whether this belongs in EEP, regardless of technical challenges. I personally think it should be, this is a feature that very large companies, resellers, or other multi-tenant like systems will want/need. Your SMB's can likely get by with the boring solution of local storage.
Continuing to discuss the technical side though:
Is Omnibus the same between EE and CE? Could we just add a validation rule there?
If we try to keep that the same, what code actually reads this config in gitlab.yml? Is Workhorse or some other solution which doesn't currently have an EE flavor?
In 9.2, for Disaster Recovery, we'd like to replicate artifact data (https://gitlab.com/gitlab-org/gitlab-ee/issues/2134). We need to coordinate to make sure we can actually do it, even with this current issue.
@joshlambert@zj just read your comment about Omnibus. We currently have no way of checking whether a feature in Omnibus is EE or CE, but in this issue https://gitlab.com/gitlab-org/gitlab-ee/issues/654, we are trying to find a way of doing this. Please ping us as you move forward with this, so we can align our efforts, especially if you want to have this for 9.2 - something that we are not ready yet to do.
Hi @regisF, I think moving artifacts into object storage actually makes DR much easier... S3 for example is designed to deliver 99.999999999% durability. (Basically nearly zero chance of losing data.) We'd need to work through the technical details of course, but we should no longer have to worry about data replication and synchronization, etc. Customers could also leverage cross region replication in S3 if they were worried about availability, as well.
For where the EEP license validation goes, @zj can you comment on where we are making changes to support this?
Our current implementation uses CarrierWave to upload artifacts. It will consist of these changes:
Current status
We have Rails part (CarrierWave) done,
We have a preliminary implementation for Workhorse,
We miss the Rails implementation that adds asynchronous upload: artifacts/authorize returns pre-signed PUT URL, artifacts moves the file to a final location and removes old one.
The 3. is tricky, as this is not the normal way of interacting with CarrierWave, and it is complicated to achieve that with a minimal amount of hacks and making sure that it doesn't break over time.
Rails
Make CarrierWave to use external storage (S3). The configuration will be read from gitlab.yml:
artifacts: enabled: true object_store: enabled: false provider: AWS # Only AWS supported at the moment access_key_id: XYZ secret_access_key: XYZ bucket: docker region: eu-central-1
When artifacts will be accessed, we generate pre-signed public URL that allows downloading file for a limited time.
Without performance improvements this makes artifacts uploading to happen during the file saving, which is the no-go for large files.
Workhorse & Rails
The second part is offloading of uploading with Workhorse and Rails. Workhorse when uploading artifacts would receive a pre-signed PUT address that will be used to storing the file externally. Since this will happen in Workhorse, the cost of storing will be non-existing. The file will be stored in the temporary location. This location is then sent to Rails to "finalize upload". Rails execute an extra operation to copy the file from one path to another (final). Then the temporary file is removed.
Usage with other storages
Since this is a generic mechanism, it allows us to later make LFS objects/uploads to be stored in the same way. Build logs are different beasts, and they will anyway require custom implementation.
Limitations
Since runner does not send Content-Length with the file, and S3 API does require it we have to send file after receiving it completely,
Downloading single artifact file will not work, as this expects the file to be stored internally.
Future improvements
Use multipart upload to S3, allowing us to do in-line uploading, instead of doing post-upload,
In the future, the pre-signed URL could be returned to the runner, and the runner could be responsible for uploading data. However, we would not have a source of truth for generated metadata, which is now generated with artifacts upload in Workhorse. This makes it possible to create a .zip bomb that could be used for exploiting gitlab-pages,
On per-project/group basis define S3 credentials for Artifacts storage, this will be easily achievable with current CarrierWave implementation, and it will work with all mechanisms described here,
reading metadata from the end of archive: transfer for reading metadata (depends on the number of files), between 2 to 4 requests (depends on the number of files),
reading the actual compressed file: one request, and egress traffic to transfer compressed metadata (we do read exact amount of what is needed).
Thus, this removes limitations for having Object Storage for Artifacts.