With gitlab 8.11.3 and runners 1.5.2 we are facing the following issue when starting a docker-based build (registry removed / XXXed)
Running with gitlab-ci-multi-runner 1.5.2 (76fdacd)Using Docker executor with image XXX:build ...Pulling docker image XXX:build ...ERROR: Build failed (system failure): API error (500): service endpoint with name runner-5bc3b764-project-112-concurrent-1-predefined already exists
The only thing helping right now is to restart the runner and the gitlab server. Is this a known issue / configuration problem on our side? After retrying several times the build eventually restarts, however, subsequent builds run into the same issue..
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: Variable $iid of type ID! was provided invalid value
No child items are currently open.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
We are experiencing same issue with similar configuration and docker v1.12.1. It seems to be a docker problem.
You can manually docker network diconnect -f stale container from network. But this bug will appear again in a few runs.
@stanhu I saw the merge conflicts and I've asked MR author for a rebase. If he will be not able to work on this, then I'll finish this myself :). But first let's test if this patch works - I've never saw this error and it's hard to reproduce.
@tmaczukin Customer has gitlab-runner 1.7.1 installed. Can you send them a binary with the changes (https://gitlab.zendesk.com/agent/tickets/48842) and instructions on how to replace the existing one (and back up the old one)? I assume we can just apply the patch on top of 1.7.1 and ship them a binary, but maybe there's a systematic way to do this with a branch and build artifact.
@stanhu Also can be found here - https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/pipelines/5016706 - artifacts for binaries and packages builds or download from S3 from URL-s in development job. I've just checked that build was finished with success. If the binary will work then I will open MR and get this into v1.8.0 :)
@tmaczukin customer sent back some updates reporting the same issue is re appearing and some more random errors as well:
Reported Sunday 23
Running with gitlab-ci-multi-runner 1.7.1 (f896af7)Using Docker executor with image containers.xxxxx.net/containers/composer:php7.0 ...Pulling docker image containers.xxxxx.net/containers/composer:php7.0 ...ERROR: Build failed (system failure): API error (500): service endpoint with name runner-43f97f81-project-573-concurrent-0-predefined already exists
Reported Friday 21:
We are still getting failures…Running with gitlab-ci-multi-runner 1.7.1 (f896af7)Using Docker executor with image docker:1 ...Pulling docker image docker:1-dind ...Starting service docker:1-dind ...Waiting for services to be up and running...*** WARNING: Service runner-ed25fbf0-project-563-concurrent-1-docker probably didn't start properly.
Also this one which seems like a network timeout but adding it for context.
Thanks @tmaczukin. It could be the patch was applied to the binary but not reflected in the git revision? Let's try the bleeding edge version in any case.
Revision is set with compilation parameters if you use make or add them yourself. If you added the patch and compiled it by go build ./... then yes, it could be not reflected. I just assumed that you've used make as in our build process.
@balamebBleeding Edge release is build from master after each push/merge into it. Tomorrow it will be tagged as v1.8 and released as a Stable release :)
And the patch for unplugging containers from docker networks is merged into master.
Running with gitlab-ci-multi-runner 1.9.0~beta.4.g81a23ec (81a23ec)Using Docker executor with image containers.spinen.net/containers/shellcheck:master ...WARNING: Can't disconnect possibly zombie container runner-43f97f81-project-573-concurrent-3-predefined from network bridge -> No such network (2315430c102c90dee4a2adf89aedd614cf9ab6aa5f4aa0a0330c1e2b2a9b056a) or container (runner-43f97f81-project-573-concurrent-3-predefined)Pulling docker image containers.spinen.net/containers/shellcheck:master ...ERROR: Build failed (system failure): API error (500): service endpoint with name runner-43f97f81-project-573-concurrent-3-predefined already exists
We are getting tons of these still on several of the jobs.
Patch from !390 (merged) is probably not working because it doesn't contain the --force flag. And in the mentioned docker issue it was commented few times that docker network disconnect is working for such case only with the --force flag.
I've prepared !432 (closed) which is adding missing flag. However for now this MR should be treat as proof of concept. It changes a vendored library which means it will be not merged to master in such form. But we can use the binary to test if this resolves the problem.
!432 (closed) will be rebased on top of !301 (merged) (replacing used go docker library with the "official" one from github.com/docker/docker/client) after Docker 1.13 will be released and !301 (merged) will be made ready to merge. Then we can replace calls updated in !432 (closed) and insert the patch into stable version.
Builds for the MR are going on. I'll post here links to download a built version after they will finish.
Is the real problem a service container name colision problem?
On my project I have six builds and all using a mysql service and this problem happens much times but aleatory.
Is there any timeline for fixing this bug? We experienced this issue in our production environment and have to restart all runner instances occasionally.
You can test this already with v9.0.0-rc.2. At upcoming Wednesday it will be released as stable release and we will port this back to v1.10.x and v1.11.x after this.
I'm closing this issue since !301 (merged) should resolve it. If the problem will still be present after upgrading to v9.0.0 (or v1.10.x/v1.11.x including the patch) then feel free to reopen it with new reports :).