Concurrent jobs on docker runner fail

Title changed from Concurrent job on docker runner fail to Concurrent jobs on docker runner fail

It's same here..

@jangrewe Do you have the same issue? I thought that it was fixed :)

@ayufan it just started happening on some runners i'm running on a Rancher cluster. Not all the time, but once these errors appear i have to kill the runner containers (not the job containers, the actual runners) so that new ones are registered.

The thing is that if you run it as separate processes the concurrent will not work :( The index is tracked internally by Runner, so if single runner executes multiple builds it is fine. Ideally you should use docker-in-docker in such case to make sure that other runners don't conflict each other.

I'm planning on preparing documentation how to run GitLab Runner as an service on Swarm/Kubernetes cluster, but also how to run builds on Swarm/Kubernetes (which is not the same). Here, you use GitLab Runner as an service, that is also you should not use the host Docker Engine, because cluster managers can affect the stability of builds.

Not sure if i follow, i don't understand what you mean with "separate processes"?

My setup is this: I have 2 RancherOS hosts and on each hosts there's a container running gitlab-ci-multi-runner. They have the docker binary and socket mounted as volumes, so they're starting the job containers as their siblings on the RancherOS machine.

I'd love to use the Autoscaling, but i'm not sure if that works with Rancher (which is basically CoreOS)?

If two runner instances end-up running on single host, and they have a concurrency limit set to non-1, and they use the same docker host you will see the above error. It happens, because the name for the container is build from the concurrency index, which for the same project will be 0.

Auto-scaling can be used to provision infrastracture, RancherOS provides CaaS. This is different concept. You can use auto-scaling on https://docs.docker.com/machine/drivers/.

Oh, i haven't been following all the fancy new Docker-something terms recently, looks like i'll be able to use the Autoscaling anyways via SSH!

But back to the original issue: Rancher is taking care of making sure that there's only 1 Runner per RancherOS host, that's working fine! I was also going so far as pausing all other Runners in GL, so there was only that single Runner being used for jobs, and still it kept on failing. When i paused that failing Runner and unpaused the other (so far unused) one, the jobs ran fine on it. Then i remove the Runner, killed the container, a new one started up, and that's now handling jobs as expected - until that error happens again, at which point i'll do the switcheroo again. =(

Time to evaluate the Autoscaling, i guess...

I too am having the same issue. There are no multiple processes running on my runner hosts. The runner hosts are not docker-in-docker, just a runner on a host that runs the docker commands.

I suspect this is a bug with the 1.1.x series of the runner, as this did not happen until I upgraded. Downgrading isn't possible due to hitting the "too many open files" bug from previous versions.

@tmaczukin Could you verify that?

I've tried to reproduce it with: 1.0.1, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.2.0beta and I wasn't able to. Builds were running in parallel without any errors.

My configuration:

concurrent = 10

[[runners]]
  name = "test-runner"
  url = "https://gitlab.com/ci"
  token = TOKEN_HERE
  executor = "docker"
  limit = 0
  [runners.docker]
    image = "ruby:2.1"

docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 373
Server Version: 1.11.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 639
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.16.0-2-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.729 GiB
Name: maczek
ID: ID_HERE
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

docker version:

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:17:17 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:17:17 2016
 OS/Arch:      linux/amd64

@SharpEdgeMarshall @naotaco @jangrewe @Cidan: Do you see any important differences (like docker version or used storage engine) between yours and mine docker configuration? Is my Runner's configuration similar to yours?

I've upgraded multi-runner since this issue was open, now i should give it a try to check if the issue persist. Could it be docker version?

My configuration:

gitlab-ci-multi-runner version: 1.1.3 (a470667)

docker info

Containers: 0
Images: 135
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 135
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.16.0-59-generic
Operating System: Ubuntu 14.04.4 LTS
CPUs: 2
Total Memory: 3.86 GiB
Name: gitlab
ID: ID_HERE
WARNING: No swap limit support

docker version

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

This issue actually goes away for me if I use something other than btrfs for my docker host filesystem. I'll chalk this one up to btrfs being somewhat of a mess and not production ready.

Thanks.

I've just updated to gitlab 8.9 and runner 1.3 and have started seeing this issue repeatedly. Most of the time hitting retry solves it, but not always

@mcfedr Could you attach debug log from when it happens?

Bumped into this too, gitlab runner 1.3.2 (0323456) concurrency was set to 1 on both runner systems. But how to get out of this 'state', restarted the runners but didn't help. Checked for some cache, but no cache in working dir (and not running in docker) now trying to tune cache folder since i moved my runners but i see that it didn't store the cache somewhere too, perhaps thats might be related.

Update: change paths to a clean directory, watched stuff being placed in that folder but still giving the same error message. Restarting / Changing / etc didn't fix the issue, i am now stuck with runners not able to run properly. Any hints/tips to kill it is appreciated.

INFO[0183] 37606 Appending trace to coordinator... ok    RemoteRange=0-402406 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=396248-402406 runner=800b09a2
INFO[0186] 37606 Appending trace to coordinator... ok    RemoteRange=0-405682 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=402406-405682 runner=800b09a2
INFO[0189] 37606 Appending trace to coordinator... ok    RemoteRange=0-406050 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=405682-406050 runner=800b09a2
INFO[0219] 37606 Appending trace to coordinator... ok    RemoteRange=0-406050 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=406050-406050 runner=800b09a2
WARN[0237] 37606 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-427375 runner=800b09a2
WARN[0240] 37606 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-427598 runner=800b09a2
WARN[0243] 37606 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-432187 runner=800b09a2
WARN[0246] 37606 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=406050-436776 runner=800b09a2

I assume thats not good too..

I currently have an normal log showing this. Ill change to debug log and send once I see this happen again.

INFO[1108270] Checking for builds... received               runner=9ed3048b
INFO[1108270] gitlab-ci-multi-runner 1.3.2 (0323456)        build=9661 runner=9ed3048b
INFO[1108270] Using Docker executor with image ekreative/php:5 ...  build=9661 runner=9ed3048b
INFO[1108270] Checking for builds... received               runner=9ed3048b
INFO[1108270] gitlab-ci-multi-runner 1.3.2 (0323456)        build=9662 runner=9ed3048b
INFO[1108270] Using Docker executor with image ekreative/php:7 ...  build=9662 runner=9ed3048b
INFO[1108270] Pulling docker image ekreative/php:5 ...      build=9661 runner=9ed3048b
INFO[1108270] Pulling docker image ekreative/php:7 ...      build=9662 runner=9ed3048b
ERRO[1108272] Build failed: container already exists        build=9662 runner=9ed3048b
INFO[1108273] 9662 Submitting build to coordinator... ok    runner=9ed3048b
INFO[1108273] 9661 Submitting build to coordinator... ok    runner=9ed3048b
INFO[1108274] 9661 Appending trace to coordinator... ok     RemoteRange=0-145 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=0-145 runner=9ed3048b
INFO[1108277] 9661 Appending trace to coordinator... ok     RemoteRange=0-1082 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=145-1082 runner=9ed3048b

The output I see for the job in gitlab is

gitlab-ci-multi-runner 1.3.2 (0323456)
Using Docker executor with image ekreative/php:7 ...
Pulling docker image ekreative/php:7 ...

ERROR: Build failed: container already exists

Ok. Did it with debug logging. Here is the region of the output. it is build #9697 that fails.

I have 4 concurrent builds and the system was busy at the time, the build was pending for a few minutes, then failed.

INFO[0615] 9692 Appending trace to coordinator... ok     RemoteRange=0-63808 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=63807-63808 runner=9ed3048b
INFO[0615] 9695 Appending trace to coordinator... ok     RemoteRange=0-35745 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=35699-35745 runner=9ed3048b
DEBU[0615] Removed container 5dc93b931da3c475181bbc67a6399d7bf04d8e0aa4984b0b0728893a640f8a31 with <nil>  build=9695 runner=9ed3048b
DEBU[0615] Removed container 93629b62445e786e581e38f36773c84e2e8f016f73c583421aa1ee89a349d71a with <nil>  build=9695 runner=9ed3048b
DEBU[0615] Closed all idle connections for docker.Client: &{false 0xc82022e9c0 <nil> 0x36b4060 unix:///var/run/docker.sock 0xc8200aa500 0xc8201e65a0 [1 18] [1 23] [1 18] 0xc82022ea20}
INFO[0616] 9695 Submitting build to coordinator... ok    runner=9ed3048b
INFO[0616] Checking for builds... received               runner=9ed3048b
DEBU[0616] Failed to requeue the runner:  9ed3048b       builds=4
INFO[0616] gitlab-ci-multi-runner 1.3.3 (6220bd5)        build=9697 runner=9ed3048b
DEBU[0616] Shell configuration: environment: []
dockercommand:
- sh
- -c
- "if [ -x /usr/local/bin/bash ]; then\n\texec /usr/local/bin/bash \nelif [ -x /usr/bin/bash
  ]; then\n\texec /usr/bin/bash \nelif [ -x /bin/bash ]; then\n\texec /bin/bash \nelif
  [ -x /usr/local/bin/sh ]; then\n\texec /usr/local/bin/sh \nelif [ -x /usr/bin/sh
  ]; then\n\texec /usr/bin/sh \nelif [ -x /bin/sh ]; then\n\texec /bin/sh \nelse\n\techo
  shell not found\n\texit 1\nfi\n\n"
command: bash
arguments: []
passfile: false
extension: ""
  build=9697 runner=9ed3048b
INFO[0616] Using Docker executor with image ekreative/php:5 ...  build=9697 runner=9ed3048b
DEBU[0616] Applying docker.Client transport fix: &{false 0xc82044f080 <nil> 0xc8204565a0 unix:///var/run/docker.sock 0xc8201b6580 0xc820339020 [1 18] [] [] <nil>}  host=unix:///var/run/docker.sock
DEBU[0616] Starting Docker command...                    build=9697 runner=9ed3048b
DEBU[0616] Creating services...                          build=9697 runner=9ed3048b
DEBU[0616] Creating cache directories...                 build=9697 runner=9ed3048b
DEBU[0616] Looking for prebuilt image gitlab-runner-prebuilt-x86_64:6220bd5 ...  build=9697 runner=9ed3048b
INFO[0616] 9693 Appending trace to coordinator... ok     RemoteRange=0-60823 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=60822-60823 runner=9ed3048b
INFO[0616] 9696 Appending trace to coordinator... ok     RemoteRange=0-7609 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=6784-7609 runner=9ed3048b
DEBU[0616] Starting cache container 830ec9e8a8df513c34a528bf9b9f5669c60368921e5f05790297aa11e97ecf7d ...  build=9697 runner=9ed3048b
DEBU[0617] Waiting for cache container 830ec9e8a8df513c34a528bf9b9f5669c60368921e5f05790297aa11e97ecf7d ...  build=9697 runner=9ed3048b
DEBU[0617] Using container 830ec9e8a8df513c34a528bf9b9f5669c60368921e5f05790297aa11e97ecf7d as cache /cache ...  build=9697 runner=9ed3048b
DEBU[0617] Looking for prebuilt image gitlab-runner-prebuilt-x86_64:6220bd5 ...  build=9697 runner=9ed3048b
DEBU[0618] Starting cache container ac0e82007731ae678e5fb212d28b9150c62382cc1b8e3a041f8167d46ecece73 ...  build=9697 runner=9ed3048b
INFO[0618] 9692 Appending trace to coordinator... ok     RemoteRange=0-63809 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=63808-63809 runner=9ed3048b
DEBU[0619] Feeding runners to channel                    builds=4
INFO[0619] 9697 Submitting build to coordinator... ok    runner=9ed3048b
INFO[0619] 9697 Appending trace to coordinator... ok     RemoteRange=0-104 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=0-104 runner=9ed3048b
INFO[0619] 9696 Appending trace to coordinator... ok     RemoteRange=0-7983 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=7609-7983 runner=9ed3048b
DEBU[0620] Waiting for cache container ac0e82007731ae678e5fb212d28b9150c62382cc1b8e3a041f8167d46ecece73 ...  build=9697 runner=9ed3048b
DEBU[0621] Using container ac0e82007731ae678e5fb212d28b9150c62382cc1b8e3a041f8167d46ecece73 as cache /builds/kidslox ...  build=9697 runner=9ed3048b
DEBU[0621] Looking for prebuilt image gitlab-runner-prebuilt-x86_64:6220bd5 ...  build=9697 runner=9ed3048b
DEBU[0621] Looking for image sha256:a8238b0d1ea98c147cc3d1da8f6b7a35dc61a7c886f7547ffb928bc365aed15b ...  build=9697 runner=9ed3048b
DEBU[0621] Removed container runner-9ed3048b-project-348-concurrent-3-predefined with No such container: runner-9ed3048b-project-348-concurrent-3-predefined  build=9697 runner=9ed3048b
DEBU[0621] Creating container runner-9ed3048b-project-348-concurrent-3-predefined ...  build=9697 runner=9ed3048b
DEBU[0621] Looking for image ekreative/php:5 ...         build=9697 runner=9ed3048b
INFO[0621] Pulling docker image ekreative/php:5 ...      build=9697 runner=9ed3048b
INFO[0622] 9697 Appending trace to coordinator... ok     RemoteRange=0-145 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=104-145 runner=9ed3048b
INFO[0622] 9696 Appending trace to coordinator... ok     RemoteRange=0-8875 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=7983-8875 runner=9ed3048b
INFO[0624] 9692 Appending trace to coordinator... ok     RemoteRange=0-63810 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=63809-63810 runner=9ed3048b
DEBU[0624] Removed container runner-9ed3048b-project-348-concurrent-3-build with <nil>  build=9697 runner=9ed3048b
DEBU[0624] Creating container runner-9ed3048b-project-348-concurrent-3-build ...  build=9697 runner=9ed3048b
ERRO[0624] Build failed: container already exists        build=9697 runner=9ed3048b
DEBU[0624] Removed container 7c97cc840643f74b6c5e9a54b9923e7b9c86e5edb73449336452372651ad4957 with <nil>  build=9697 runner=9ed3048b
DEBU[0624] Closed all idle connections for docker.Client: &{false 0xc82044f1a0 <nil> 0x36b4060 unix:///var/run/docker.sock 0xc8201b6580 0xc820339020 [1 18] [1 23] [1 18] 0xc82044f1d0}
INFO[0624] 9697 Submitting build to coordinator... ok    runner=9ed3048b
DEBU[0624] Checking for builds... nothing                runner=9ed3048b
INFO[0625] 9693 Appending trace to coordinator... ok     RemoteRange=0-60824 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=60823-60824 runner=9ed3048b
INFO[0626] 9696 Appending trace to coordinator... ok     RemoteRange=0-9684 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=8875-9684 runner=9ed3048b
INFO[0627] 9692 Appending trace to coordinator... ok     RemoteRange=0-63811 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=63810-63811 runner=9ed3048b

@mcfedr what my issue also was (need to test further) but the cache folder was not correct and the filesystem was full after that. Which created strange issues. Adding more space to the cache folder resolved some of my issues. Don't know if you use cache, just throwing it out there.

I use the cache, dont have disk space issues at the moment.

My config file looks like this:

[[runners]]
  url = "https://git.example.com/ci"
  token = "xxx"
  tls-skip-verify = false
  tls-ca-file = ""
  name = "docker-server"
  executor = "docker"
  [runners.docker]
    image = "ubuntu"
    privileged = false
    volumes = ["/cache"]

Here some more output from my issue, it start of fine doing stuff in container with id 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b, and then at some point it gets the 500 errors back. From that moment on, the container is 'lost' and will fail.

  build=39919 runner=800b09a2
Starting container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ...  build=39919 runner=800b09a2
Attaching to container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ...  build=39919 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-690 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=308-690 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-828 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=690-828 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-2062 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=828-2062 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-2181 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2062-2181 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-2313 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2181-2313 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-2547 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2313-2547 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-2638 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2547-2638 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-2868 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2638-2868 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-3412 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=2868-3412 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-3898 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=3412-3898 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-3898 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=3898-3898 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4007 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=3898-4007 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4064 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4007-4064 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4186 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4064-4186 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4186-4243 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4243-4243 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4243-4243 runner=800b09a2
39919 Appending trace to coordinator... ok          RemoteRange=0-4243 RemoteState=running ResponseMessage=202 Accepted ResponseStatusCode=202 SentRange=4243-4243 runner=800b09a2
WARNING: 39919 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-25686 runner=800b09a2
WARNING: 39919 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-25882 runner=800b09a2
WARNING: 39919 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-25882 runner=800b09a2
*message repeated more times*
WARNING: 39919 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-101224 runner=800b09a2
Waiting for container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ...  build=39919 runner=800b09a2
Container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b finished with <nil>  build=39919 runner=800b09a2
Executing on runner-800b09a2-project-1290-concurrent-0-build the set -eo pipefail
set +o noclobber
: | eval ''
  build=39919 runner=800b09a2
Starting container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ...  build=39919 runner=800b09a2
WARNING: 39919 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-105311 runner=800b09a2
Attaching to container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ...  build=39919 runner=800b09a2
Waiting for container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b ...  build=39919 runner=800b09a2
Container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b finished with <nil>  build=39919 runner=800b09a2
Executing on runner-800b09a2-project-1290-concurrent-0-predefined the set -eo pipefail
*exexcutes its stuff*
  build=39919 runner=800b09a2
Starting container 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060 ...  build=39919 runner=800b09a2
ERROR: Build failed: API error (404): No such container: 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060
  build=39919 runner=800b09a2
Removed container 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060 with No such container: 9f4a8f4e484ea01a2fe6a482d341b51a4710f3532ecc17185b1721ee2c696060  build=39919 runner=800b09a2
WARNING: 39919 Appending trace to coordinator... failed  RemoteRange= RemoteState= ResponseMessage=500 Internal Server Error ResponseStatusCode=500 SentRange=4243-105447 runner=800b09a2
Removed container 8e5330484b64bc324bc0e7750c777f1ef7d4390310a22f3e1ae30d5e87532d1b with <nil>  build=39919 runner=800b09a2
Closed all idle connections for docker.Client: &{false 0xc8204388a0 <nil> 0x36e9060 unix:///var/run/docker.sock 0xc82017a300 0xc8203b8360 [1 18] [1 24] [1 18] 0xc8204388d0}
39919 Submitting build to coordinator... ok         runner=800b09a2

same here, could anyone find a workaround for this? Using gitlab-runner==1.4.1, gitlab==8.10.2, ``docker==1.11.2, build b9f10c9on centos7.2`.

I've set docker config to --storage-opt dm.basesize=20G --storage-driver=devicemapper --storage-opt=dm.thinpooldev=/dev/mapper/centos-thinpool --storage-opt dm.use_deferred_removal=false btw, maybe we could find a workaround by using a different/non-conflicting storage provider.

I was able to trigger the "container already exists" error multiple times and reproducible. In journalctl I always got these two messages: Jul 28 10:32:31 ci-linux.itiso.net docker[1191]: time="2016-07-28T10:32:31.428850856+02:00" level=error msg="Handler for DELETE /v1.18/containers/runner-03e2d752-project-479-concurrent-3-predefined returned error: No such container: runner-03e2d752-project-479-concurrent-3-predefined" Jul 28 10:32:31 ci-linux.itiso.net docker[1191]: time="2016-07-28T10:32:31.430649681+02:00" level=error msg="Handler for POST /v1.18/containers/create returned error: Conflict. The name "/runner-03e2d752-project-479-concurrent-3-predefined" is already in use by container 11e1546aa8545d23a1e1836f6d2e916008cfcde415320e2007d3cb281417050c. You have to remove (or rename) that container to be able to reuse that name."

The container name was always the same, and the container id was always the same. I am using Docker 1.11.2 and I tried to find the container with docker ps -a After I finally got a second person to look at my findings, the build suddenly worked. I guess because another build finished. I suspect this is a docker issue.

There is an open issue for the problem I noticed in docker: https://github.com/docker/docker/issues/24706

If i have a runner on docker 1.12 it will fail with this api container error. If i run on 1.9.x version of docker, i have no problems. i'll try some more combinations. Update: scrap that, have it now on 1.9.x too.. just thinking out loud here, i cleaned up a lot of stuff from docker, could it be that the runner still keeps the id's for the cache that was used in the past and tries to find them while they are gone (so it should rebuild?)

I managed to solve my "Build failed with: API error (404): no such id:" the problem is that you can have either cache on the file system or as docker images. I had this as docker images, only the 'admins' of the server installed a crontab that would remove old/not used docker images. Since my job took around 7 minutes it would almost break every time because that crontab was throwing away the cache containers. You can see them with docker ps -a, basicly if you run the job and see those are missing. Then 'something' is removing those entry's. This at least solves that issue and my builds can continue, but i still see the error 500 in the logs that the coordinator fails. When this happens i don't get any updates on the screen untill it is finished. I was thinking that maybe it hits the limit of the screen output but last time i hit that i would get a error message saying that would be the case. Which is not right now. Hope this helps someone too.

We are seeing this issue as well.

ERROR: Build failed (system failure): API error (404): no such id: 26b28c23149a6770e7dce8f7af87ec76dbde37270519939dc87194acfce37d33

This may also be experienced by a customer in https://gitlab.zendesk.com/agent/tickets/29782. @ayufan @tmaczukin Can you have a look, please?

Added ~58730 ~167310 labels

Customer also has concurrency set to 1 but is experiencing this issue. Is it possible that this could be due to the mentioned docker issue where a container by the same name cannot be later recreated? (https://github.com/docker/docker/issues/24706)

I can confirm the same fix @riemers mentioned; we were running docker-gc on a 2hr cadence, and confirmed that it was removing containers named like runner--concurrent created by the specific build.

I was able to reproduce by launching a build, and running docker-gc during that build to create a failure.

Our specific fix was to add a container exclusion pattern to docker-gc.

We're experiencing the same issue.

@matt-deboer does it mean that your container builds are not garbage collected anymore?

Customer doesn't use docker gc but is using the following strategy. Not sure if this would cause the same issue others as docker gc:

docker rm -v $(docker ps -a -q -f status=exited) 
docker rmi $(docker images -f "dangling=true" -q) 
docker run -v /var/run/docker.sock:/var/run/docker.sock -v 
/var/lib/docker:/var/lib/docker --rm docker-cleanup-volumes

@mente the problem was actually a garbage-collection conflict--gitlab was attempting to clean up after itself, and failed upon trying to delete containers that were no longer present (having been removed by docker-gc); so after excluding from docker-gc, the containers are cleaned up normally by gitlab runner.

@dblessing I was able to reproduce the issue initially by just manually deleting a particular container, so it's not specific to docker-gc...

@matt-deboer then it won't help us. We have the same problem without any cleanup scripts. All we have is gitlab-ci-runner-docker-cleanup

Did you mean this? https://gitlab.com/gitlab-org/gitlab-runner-docker-cleanup If so, it looks like it basically does the same thing...you can always try forcing it to run during a build and see if you can repro

yeap but issue was reproducible even when we've stopped this container

There's a few issues here:

Build failed with: container already exists it can happen, because sometimes container do fail and then we are unable to remove them due to zombie processes. It mostly happens when using AUFS (I happen to see that quite often). So yes, it's possible as asked in this question: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/1038#note_13920385. We have very good experience using OverlayFS on newer kernel version (from 4.2).
Build failed with: API error (404): no such id: it usually happens when there's a application that is cleaning and removing old containers and docker images. To mitigate this kind of problems we did add a https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/244 which retries a build if it happened during executor preparation phase.

This project monitors existing containers and only removes the old ones: https://gitlab.com/gitlab-org/gitlab-runner-docker-cleanup.

Please attach gitlab-runner --debug run logs. It will also make it easier for me to debug why this happens and possibly propose some workaround / fixes to the issue.

Reassigned to @ayufan

Milestone changed to %v1.6

mentioned in issue #1307

Milestone changed to %Backlog

mentioned in issue #2090

Concurrent jobs on docker runner fail

Steps to reproduce:

Output

1st Job

2nd Job

Designs

Child items ...

Activity

Admin message

Admin message

Concurrent jobs on docker runner fail

Steps to reproduce:

Output

1st Job

2nd Job

Activity