GitLab stores information where artifacts are stored: local or remote storage. By default all artifacts are stored locally, enabling object storage doesn't change that, artifacts are continued to be stored locally, but we gain an ability to migrate all existing artifacts to object storage and be served from there.
How to migrate artifacts:
Migrate all artifacts starting from oldest one, this is currently synchronous, single threaded process:
gitlab-rakegitlab:artifacts:migrate
Migrate specific artifact for the specific build (from rails/console):
@pcarranza I would be happy to. I would like to note that I am out next week and ZJ is out tomorrow, however. Thus we would not really be able to work together until the week of the 14th.
If we have a bucket we should first move only those artifacts by doing:
project=Project.find_by_full_path('zj/artifacts-moving-temp-project')#only one right nowproject.builds.eachdo|build|build.artifacts_file.migrate!(ArtifactUploader::REMOTE_STORE)build.artifacts_metadata.migrate!(ArtifactUploader::REMOTE_STORE)end
I doubt timing this will be useful, so this should only serve to check if this works. Than some manual check on the project.
@zj so just to be clear, right now GitLab only supports moving old artifacts to object storage for archival. New artifacts are still saved to the "local" disk and can be archived later, correct?
@zj Do we have to have gitlab_rails['artifacts_object_store_enabled'] = true in order to test the uploading? I ask because the docs say that enabling this will prevent browsing of artifacts via web, and that seems like something we wouldn't want to do. Relevant WIP MR is chef-repo!1014
Also, what version are we targeting for being able to use solely object storage for artifacts? With the current implementation we will be able to archive all of our old artifacts and save lots of money by deleting the old artifact storage server, however ideally we will be able to bypass local storage altogether fairly soon.
@pcarranza@sitschner From CI team side I believe that it will happen with 10.2. In 10.1 we will make auto-archive to happen automatically. The 10.2 should bring built-in support in Runner to directly connect Object Storage and perform upload.
We're waiting on 9.5.1 to continue the the next round of testing.
The next target would be the gitlab-org group. This migration will run much longer, allowing us to determine how fast we're migrating the data to s3. Given this is a larger scale test we can than also determine the ingress/egress ratio. This might be ok now, or indicate we need to optimise our uploader.
Not sure if I can comment on this in the open, but our ingress can be viewed here
We have begun moving old gitlab-org/gitlab-ce artifacts to S3. This is running on the deploy node in a tmux session. I have no idea how long it will take, but at an estimation of 100/minute, it will take roughly 48 hours to finish.
At the time of writing we're transferring the www-gitlab-org, which actually is a tiny amount of artifacts. After which we'll push this a bit more aggresively:
First we transfer everything created over 1 year ago. If those are done, every artifact create before May of this year.
So far our findings:
We try to move files which are expired by sidekiq already,
The sequential rate on the hours we ran this, is about 80 an hour.
So far no one complained about artifacts acting weird
require'logger'logger=Logger.new(STDOUT)transferred=0# Mutes the AR queries being runActiveRecord::Base.logger=nilCi::Build.joins(:project).with_artifacts.where('created_at < ?',1.year.ago).where(artifacts_file_store: [nil,ArtifactUploader::LOCAL_STORE]).find_each(batch_size: 100)do|build|beginbuild.artifacts_file.migrate!(ArtifactUploader::REMOTE_STORE)build.artifacts_metadata.migrate!(ArtifactUploader::REMOTE_STORE)transferred+=1logger.info"Migrated build ##{transferred} of about 300000"iftransferred%100==0rescue=>elogger.error("Failed transfering #{build.id}: #{e.message}"endend
Given the limitations and bounds of Ci::Builds to migrate, we can't yet use the rake task
I, [2017-08-30T23:22:55.004969 #3158] INFO -- : Migrated build #92900 of about 300000I, [2017-08-30T23:23:25.987685 #3158] INFO -- : Migrated build #93000 of about 300000I, [2017-08-30T23:25:04.392878 #3158] INFO -- : Migrated build #93100 of about 300000
@zj Things are moving along, but I checked just now and there are a few errors such as the following:
E, [2017-08-31T15:07:59.653785 #3158] ERROR -- : Failed transfering 2286926: getaddrinfo: Name or service not known (SocketError)E, [2017-08-31T15:07:59.763166 #3158] ERROR -- : Failed transfering 2286930: getaddrinfo: Name or service not known (SocketError)E, [2017-08-31T15:08:00.763608 #3158] ERROR -- : Failed transfering 2286949: getaddrinfo: Name or service not known (SocketError)E, [2017-08-31T15:08:00.781340 #3158] ERROR -- : Failed transfering 2286953: getaddrinfo: Name or service not known (SocketError)E, [2017-08-31T15:08:01.057677 #3158] ERROR -- : Failed transfering 2286966: getaddrinfo: Name or service not known (SocketError)
Is this something we need to do something about? The script itself is continuing and is currently at
I, [2017-08-31T15:24:29.652392 #3158] INFO -- : Migrated build #172900 of about 300000
We had a few more errors, a sample of which is below:
E, [2017-09-01T15:52:13.446584 #3158] ERROR -- : Failed transfering 3055577: getaddrinfo: Name or service not known (SocketError)E, [2017-09-01T15:52:13.465610 #3158] ERROR -- : Failed transfering 3055579: getaddrinfo: Name or service not known (SocketError)E, [2017-09-01T15:52:13.498819 #3158] ERROR -- : Failed transfering 3055586: getaddrinfo: Name or service not known (SocketError)E, [2017-09-01T15:52:17.568352 #3158] ERROR -- : Failed transfering 3055597: getaddrinfo: Name or service not known (SocketError)E, [2017-09-01T15:52:17.621078 #3158] ERROR -- : Failed transfering 3055621: getaddrinfo: Name or service not known (SocketError)
And we are nearing the end of this run:
I, [2017-09-01T16:49:20.883265 #3158] INFO -- : Migrated build #284600 of about 300000I, [2017-09-01T16:50:53.376675 #3158] INFO -- : Migrated build #284700 of about 300000I, [2017-09-01T16:52:00.526847 #3158] INFO -- : Migrated build #284800 of about 300000
@zj and I decided restarted the script using require 'resolv-replace', which should fix the resolution errors. Since Monday is a US holiday, this will give it plenty of time to transfer!
@zj I've just checked so that we can begin the next batch. We did still have a few name errors, even after adding the require 'resolv-replace'.
E, [2017-09-02T07:38:40.762392 #3158] ERROR -- : Failed transfering 3656704: getaddrinfo: Name or service not known (SocketError)E, [2017-09-02T07:38:40.781723 #3158] ERROR -- : Failed transfering 3656705: getaddrinfo: Name or service not known (SocketError)E, [2017-09-02T07:38:40.802153 #3158] ERROR -- : Failed transfering 3656706: getaddrinfo: Name or service not known (SocketError)E, [2017-09-02T07:38:50.793500 #3158] ERROR -- : Failed transfering 3656711: getaddrinfo: Name or service not known (SocketError)
@zj@ayufan and I did some troubleshooting on this today. We discovered that from time to time Amazon will return a SERVFAIL for the gitlab-artifacts.s3.amazonaws.com record. This causes the failure and is cached for about 60 seconds. We are discussion a solution, but this is definitely a problem on Amazon's side we will need to work around.
@ayufan definitely Amazon. We do not use Azure's resolvers, we use our own. @northrup can probably give more details, but we don't think our server is the problem here.
I have just checked up on the transfer. It is still going acceptably. There have been a few name resolution errors, however. I think it may be our server since we are doing so many queries.
@northrup is there rate limiting on our DNS servers?
The transfer is still going. We have transferred a total of 7.7 TB, thats 1.4 TB since my last updated. Not bad!
We are still getting the name errors, even after #2824 (closed) was resolved. As such, perhaps @northrup and I can meet about this on Monday. I tried to debug it myself, but I must be missing something.
@zj for these failures should we be retrying? Independent of these lookup failures, the S3 api availability of S3 is not going to be high enough for there not to be occasional 500s anyway for puts.
@zj So what is our expectation on migrating these artifacts? Am I clear in thinking your comment means that we won't be able to be confident that all artifacts have been migrated until 10.1 or later?
@northrup and I were unable to meet about DNS due to the .com issues this morning and afternoon. We will be meeting tomorrow. As for the transfer, it is going on with a few name errors as expected. We have currently moved 11.1 TB into S3 (wow!).
@zj So what is our expectation on migrating these artifacts? Am I clear in thinking your comment means that we won't be able to be confident that all artifacts have been migrated until 10.1 or later?
Re-run migration. It will pick what is left off. The process is idempotent.
The migration is ongoing. We have moved a total of 13.3 TB into S3. As an interesting side note, we have 1857359 objects in the artifacts bucket so far!
An update for you on this lovely Friday afternoon. The process is still moving as expected. Due to other fires and some time off, we still haven't uncovered the DNS issues fully. However, as @ayufan said, it is idempotent so we will be able to re-run this and catch all the failures once we fix DNS.
We have moved a total of 15.6TB of artifacts into S3 as of now, a total of 2016661 objects.
The transfer finished. Since obviously there were some failures, I'm going to restart the process. We still need to fix the DNS issues though so we can be certain everything is migrated.
We have 21.8 TB and 2411148 objects there so far.
@zj is there a way to see how many artifacts are left?
@ahanselka We could write a query, but please lets talk about this a bit more before restarting. Next release will most probably include a migration that will automatically migrate all the artifacts, that way we don't have to monitor it anymore.