@ayufan it just started happening on some runners i'm running on a Rancher cluster. Not all the time, but once these errors appear i have to kill the runner containers (not the job containers, the actual runners) so that new ones are registered.
The thing is that if you run it as separate processes the concurrent will not work :( The index is tracked internally by Runner, so if single runner executes multiple builds it is fine.
Ideally you should use docker-in-docker in such case to make sure that other runners don't conflict each other.
I'm planning on preparing documentation how to run GitLab Runner as an service on Swarm/Kubernetes cluster, but also how to run builds on Swarm/Kubernetes (which is not the same). Here, you use GitLab Runner as an service, that is also you should not use the host Docker Engine, because cluster managers can affect the stability of builds.
Not sure if i follow, i don't understand what you mean with "separate processes"?
My setup is this: I have 2 RancherOS hosts and on each hosts there's a container running gitlab-ci-multi-runner. They have the docker binary and socket mounted as volumes, so they're starting the job containers as their siblings on the RancherOS machine.
I'd love to use the Autoscaling, but i'm not sure if that works with Rancher (which is basically CoreOS)?
If two runner instances end-up running on single host, and they have a concurrency limit set to non-1, and they use the same docker host you will see the above error. It happens, because the name for the container is build from the concurrency index, which for the same project will be 0.
Auto-scaling can be used to provision infrastracture, RancherOS provides CaaS. This is different concept. You can use auto-scaling on https://docs.docker.com/machine/drivers/.
Oh, i haven't been following all the fancy new Docker-something terms recently, looks like i'll be able to use the Autoscaling anyways via SSH!
But back to the original issue: Rancher is taking care of making sure that there's only 1 Runner per RancherOS host, that's working fine! I was also going so far as pausing all other Runners in GL, so there was only that single Runner being used for jobs, and still it kept on failing.
When i paused that failing Runner and unpaused the other (so far unused) one, the jobs ran fine on it. Then i remove the Runner, killed the container, a new one started up, and that's now handling jobs as expected - until that error happens again, at which point i'll do the switcheroo again. =(
I too am having the same issue. There are no multiple processes running on my runner hosts. The runner hosts are not docker-in-docker, just a runner on a host that runs the docker commands.
I suspect this is a bug with the 1.1.x series of the runner, as this did not happen until I upgraded. Downgrading isn't possible due to hitting the "too many open files" bug from previous versions.
I've tried to reproduce it with: 1.0.1, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.2.0beta and I wasn't able to. Builds were running in parallel without any errors.
Client: Version: 1.11.0 API version: 1.23 Go version: go1.5.4 Git commit: 4dc5990 Built: Wed Apr 13 18:17:17 2016 OS/Arch: linux/amd64Server: Version: 1.11.0 API version: 1.23 Go version: go1.5.4 Git commit: 4dc5990 Built: Wed Apr 13 18:17:17 2016 OS/Arch: linux/amd64
@SharpEdgeMarshall@naotaco@jangrewe@Cidan: Do you see any important differences (like docker version or used storage engine) between yours and mine docker configuration? Is my Runner's configuration similar to yours?
This issue actually goes away for me if I use something other than btrfs for my docker host filesystem. I'll chalk this one up to btrfs being somewhat of a mess and not production ready.
Bumped into this too, gitlab runner 1.3.2 (0323456) concurrency was set to 1 on both runner systems. But how to get out of this 'state', restarted the runners but didn't help. Checked for some cache, but no cache in working dir (and not running in docker) now trying to tune cache folder since i moved my runners but i see that it didn't store the cache somewhere too, perhaps thats might be related.
Update: change paths to a clean directory, watched stuff being placed in that folder but still giving the same error message. Restarting / Changing / etc didn't fix the issue, i am now stuck with runners not able to run properly. Any hints/tips to kill it is appreciated.
INFO[0183] 37606 Appending trace to coordinator... ok RemoteRange=0-402406 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=396248-402406 runner=800b09a2INFO[0186] 37606 Appending trace to coordinator... ok RemoteRange=0-405682 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=402406-405682 runner=800b09a2INFO[0189] 37606 Appending trace to coordinator... ok RemoteRange=0-406050 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=405682-406050 runner=800b09a2INFO[0219] 37606 Appending trace to coordinator... ok RemoteRange=0-406050 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=406050-406050 runner=800b09a2WARN[0237] 37606 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-427375 runner=800b09a2WARN[0240] 37606 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-427598 runner=800b09a2WARN[0243] 37606 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-432187 runner=800b09a2WARN[0246] 37606 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-436776 runner=800b09a2
@mcfedr what my issue also was (need to test further) but the cache folder was not correct and the filesystem was full after that. Which created strange issues. Adding more space to the cache folder resolved some of my issues. Don't know if you use cache, just throwing it out there.
Here some more output from my issue, it start of fine doing stuff in container with id 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b, and then at some point it gets the 500 errors back. From that moment on, the container is 'lost' and will fail.
build=39919 runner=800b09a2Starting container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ... build=39919 runner=800b09a2Attaching to container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ... build=39919 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-690 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=308-690 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-828 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=690-828 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-2062 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=828-2062 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-2181 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2062-2181 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-2313 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2181-2313 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-2547 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2313-2547 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-2638 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2547-2638 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-2868 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2638-2868 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-3412 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2868-3412 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-3898 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=3412-3898 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-3898 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=3898-3898 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4007 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=3898-4007 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4064 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4007-4064 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4186 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4064-4186 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4186-4243 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4243-4243 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4243-4243 runner=800b09a239919 Appending trace to coordinator... ok RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4243-4243 runner=800b09a2WARNING: 39919 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-25686 runner=800b09a2WARNING: 39919 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-25882 runner=800b09a2WARNING: 39919 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-25882 runner=800b09a2*message repeated more times*WARNING: 39919 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-101224 runner=800b09a2Waiting for container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ... build=39919 runner=800b09a2Container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b finished with <nil> build=39919 runner=800b09a2Executing on runner-800b09a2-project-1290-concurrent-0-build the set -eo pipefailset +o noclobber: | eval '' build=39919 runner=800b09a2Starting container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ... build=39919 runner=800b09a2WARNING: 39919 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-105311 runner=800b09a2Attaching to container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ... build=39919 runner=800b09a2Waiting for container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ... build=39919 runner=800b09a2Container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b finished with <nil> build=39919 runner=800b09a2Executing on runner-800b09a2-project-1290-concurrent-0-predefined the set -eo pipefail*exexcutes its stuff* build=39919 runner=800b09a2Starting container 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060 ... build=39919 runner=800b09a2ERROR: Build failed: API error (404): No such container: 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060 build=39919 runner=800b09a2Removed container 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060 with No such container: 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060 build=39919 runner=800b09a2WARNING: 39919 Appending trace to coordinator... failed RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-105447 runner=800b09a2Removed container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b with <nil> build=39919 runner=800b09a2Closed all idle connections for docker.Client: &{false 0xc8204388a0 <nil> 0x36e9060 unix:///var/run/docker.sock 0xc82017a300 0xc8203b8360 [1 18] [1 24] [1 18] 0xc8204388d0}39919 Submitting build to coordinator... ok runner=800b09a2
same here, could anyone find a workaround for this? Using gitlab-runner==1.4.1, gitlab==8.10.2, ``docker==1.11.2, build b9f10c9on centos7.2`.
I've set docker config to --storage-opt dm.basesize=20G --storage-driver=devicemapper --storage-opt=dm.thinpooldev=/dev/mapper/centos-thinpool --storage-opt dm.use_deferred_removal=false btw, maybe we could find a workaround by using a different/non-conflicting storage provider.
I was able to trigger the "container already exists" error multiple times and reproducible.
In journalctl I always got these two messages:
Jul 28 10:32:31 ci-linux.itiso.net docker[1191]: time="2016-07-28T10:32:31.428850856+02:00" level=error msg="Handler for DELETE /v1.18/containers/runner-03e2d752-project-479-concurrent-3-predefined returned error: No such container: runner-03e2d752-project-479-concurrent-3-predefined"
Jul 28 10:32:31 ci-linux.itiso.net docker[1191]: time="2016-07-28T10:32:31.430649681+02:00" level=error msg="Handler for POST /v1.18/containers/create returned error: Conflict. The name "/runner-03e2d752-project-479-concurrent-3-predefined" is already in use by container 11e1546aa8545d23a1e1836f6d2e916008cfcde415320e2007d3cb281417050c. You have to remove (or rename) that container to be able to reuse that name."
The container name was always the same, and the container id was always the same.
I am using Docker 1.11.2 and I tried to find the container with docker ps -a
After I finally got a second person to look at my findings, the build suddenly worked. I guess because another build finished.
I suspect this is a docker issue.
If i have a runner on docker 1.12 it will fail with this api container error. If i run on 1.9.x version of docker, i have no problems. i'll try some more combinations.
Update: scrap that, have it now on 1.9.x too.. just thinking out loud here, i cleaned up a lot of stuff from docker, could it be that the runner still keeps the id's for the cache that was used in the past and tries to find them while they are gone (so it should rebuild?)
I managed to solve my "Build failed with: API error (404): no such id:" the problem is that you can have either cache on the file system or as docker images. I had this as docker images, only the 'admins' of the server installed a crontab that would remove old/not used docker images. Since my job took around 7 minutes it would almost break every time because that crontab was throwing away the cache containers. You can see them with docker ps -a, basicly if you run the job and see those are missing. Then 'something' is removing those entry's. This at least solves that issue and my builds can continue, but i still see the error 500 in the logs that the coordinator fails. When this happens i don't get any updates on the screen untill it is finished. I was thinking that maybe it hits the limit of the screen output but last time i hit that i would get a error message saying that would be the case. Which is not right now. Hope this helps someone too.
Customer also has concurrency set to 1 but is experiencing this issue. Is it possible that this could be due to the mentioned docker issue where a container by the same name cannot be later recreated? (https://github.com/docker/docker/issues/24706)
I can confirm the same fix @riemers mentioned; we were running docker-gc on a 2hr cadence, and confirmed that it was removing containers named like runner--concurrent created by the specific build.
I was able to reproduce by launching a build, and running docker-gc during that build to create a failure.
Our specific fix was to add a container exclusion pattern to docker-gc.
@mente the problem was actually a garbage-collection conflict--gitlab was attempting to clean up after itself, and failed upon trying to delete containers that were no longer present (having been removed by docker-gc); so after excluding from docker-gc, the containers are cleaned up normally by gitlab runner.
@dblessing I was able to reproduce the issue initially by just manually deleting a particular container, so it's not specific to docker-gc...
Build failed with: container already exists it can happen, because sometimes container do fail and then we are unable to remove them due to zombie processes. It mostly happens when using AUFS (I happen to see that quite often). So yes, it's possible as asked in this question: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/1038#note_13920385. We have very good experience using OverlayFS on newer kernel version (from 4.2).
Build failed with: API error (404): no such id: it usually happens when there's a application that is cleaning and removing old containers and docker images. To mitigate this kind of problems we did add a https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/244 which retries a build if it happened during executor preparation phase.
Please attach gitlab-runner --debug run logs. It will also make it easier for me to debug why this happens and possibly propose some workaround / fixes to the issue.