We have a customer that would like some assistance with GitLab CI, they're unsure if their current setup provides optional performance. They're also encountering some delays or non-processing of requests when cancelling builds.
This issue will serve as a way to communicate any potential causes of these problems or alternative configurations with support, CI team and customer.
We have been struggling with stability issues with GitLab CI for the past few weeks. For some context, we use GitLab CI to interact with Docker containers, which do all of the specialized work. There are a few specific issues we are experiencing:
Our builds will fairly often hang in that they are running, but nothing is happening.
When we cancel the build, the GitLab UI shows the build as cancelled. However, the CI runner still continues. In addition, pending jobs are not picked up to be run. Our administrators need to manually kill to the docker process, after which the pending jobs will be run.
It would be very helpful to have our issues answered. In addition, best practices for running GitLab CI with Docker builds would be great. The documentation offers options; however, just telling us how to do it the right way would be more valuable.
* GitLab version: GitLab Enterprise Edition 8.9.4-ee ea8a665 * gitlab-ci-runner version: 1.3.2 * Docker version:Client: Version: 1.11.2 API version: 1.23 Go version: go1.5.4 Git commit: b9f10c9 Built: Wed Jun 1 21:47:50 2016 OS/Arch: linux/amd64Server: Version: 1.11.2 API version: 1.23 Go version: go1.5.4 Git commit: b9f10c9 Built: Wed Jun 1 21:47:50 2016 OS/Arch: linux/amd64* OS for GitLab CI RunnerDistributor ID: Ubuntu Description: Ubuntu 14.04.2 LTS Release: 14.04 Codename: trusty* OS for GitLab: we are running the Docker image gitlab/gitlab-ee:8.9.4-ee.1
@MrChrisW I see that user uses shell executors. But in the description you're writing about interaction with docker containers which after build is canceled need to by manually kill. Also this build hanging is strange.
Can you get from the user:
an example .gitlab-ci.yml file of a project with those docker-related builds,
an example .gitlab-ci.yml file of a project with those hanging builds (it can be the same as the above one if those are the same builds),
a log file of GitLab Runner's process (as it is - if this will be not enough then we well need to restart runner with --debug to get more data)?
Having this data we can try to investigate what is happening :)
We had a build that had been hanging for 42 hours. I cancelled it on the UI; the docker process continued. I ended with with "docker stop CONTAINER_ID". Relevant logs are attached.
@MrChrisW Did you restart the service after adding -l "debug"? I can't see any debug information in the syslog files. Can we also get log files /var/log/upstart/gitlab-runner.log* from the period when the above problem occurred?
We can see that watch -n 1 date comand has a PID=13483 and a PGID=13480 which is also a PGID and PID of the /bin/bash --login ... command.
Let's execute a command: kill -9 13480 (notice, that there is no - sign before PID) which should kill the /bin/bash ... process (and which I assume should also kill all children/related processes)
In the first terminal we should see that the process was terminated, but - at least on my computer - I still can see the numbers updated each second on top of the screen!
Let's execute a command:
$ ps jx | grep-e watch -e PGID | grep-vgrepPPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 13483 13480 12374 pts/0 12374 S 1000 0:00 watch -n 1 date
We can see that /bin/bash ... command was terminated, but the watch -n 1 date command still exists! Which can by confirmed by still updated time on my first terminal.
Test using runner
Let's use a simple project with such .gitlab-ci.yml:
test:script:-echo test-watch -n 1 date
Using exec command
Let's execute a command in projects directory in first terminal: gitlab-runner exec shell --timeout 240 test
We should see a date and time string updated each second
Let's execute a command in second terminal:
$ ps jx | grep-e watch -e"bash --login"-e PGID | grep-vgrepPPID PID PGID SID TTY TPGID STAT UID TIME COMMAND16315 16340 16340 12374 pts/0 16315 S 1000 0:00 bash --login16340 16345 16340 12374 pts/0 16315 S 1000 0:00 bash --login16345 16346 16340 12374 pts/0 16315 S 1000 0:00 watch -n 1 date
Let's execute a command: kill -9 16340 (normal kill)
Let's execute a command:
$ ps jx | grep-e watch -e"bash --login"-e PGID | grep-vgrepPPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 16345 16340 12374 pts/0 16315 S 1000 0:00 bash --login16345 16346 16340 12374 pts/0 16315 S 1000 0:00 watch -n 1 date
We can see that watch -n 1 date command is still running.
Let's kill those processes (kill -9 16345; kill -9 16346) and repeat all steps, but using process gropu kill (with - just before PGID, eg. kill -9 -16340) in 3-rd step:
$ ps jx | grep-e watch -e"bash --login"-e PGID | grep-vgrepPPID PID PGID SID TTY TPGID STAT UID TIME COMMAND17686 17770 17770 17573 pts/3 17686 S 1000 0:00 bash --login17770 17774 17770 17573 pts/3 17686 S 1000 0:00 bash --login17774 17775 17770 17573 pts/3 17686 S 1000 0:00 watch -n 1 date
Let's trigger the cancel button:
Appending trace to coordinator... ok build=3657756 build-log=0-2666 build-status=running code=202 runner=375b9fd1 sent-log=2570-2666 status=202 AcceptedWARNING: Appending trace to coordinator aborted build=3657756 build-log=0-2738 build-status=canceled code=202 runner=375b9fd1 sent-log=2666-2738 status=202 AcceptedWaiting for build to finish... build=3657756 error=canceled project=1608547 runner=375b9fd1Aborting command... build=3657756 project=1608547 runner=375b9fd1WARNING: Build failed: canceled build=3657756 project=1608547 runner=375b9fd1WARNING: Submitting build to coordinator... aborted build=3657756 runner=375b9fd1
Let's execute a command:
$ ps jx | grep-e watch -e"bash --login"-e PGID | grep-vgrepPPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
As we can see, all commands are terminated.
I've repeated this tests for Runner in versions: v1.3.0, v1.4.1, v1.4.2, v1.5.2. Each time all commands were terminated.
Test on production environment
While reproducing this on production environment we've found that while /bin/bash ... script is terminated, a sleep 300 command is still working. We've done the test using cancel button in GitLab UI. Both GitLab and Runner were installed on production infrastructure.
Conclusions
Error seems to be related to environment. Locally I'm using a different operating system that the production environment used for test. It looks that in some way the production OS is not doing a "process group" kill but it's killing only the main process.
I'll try to repeat the runner-based test with a configuration similar to the production one.
After finding the cause of not-doing process group kill by the Runner on production environment we should consider if we can handle this in Runner itself or if we should update the documentation and describe required configuration of the OS.
After @ayufan's suggestions I've made a test with a Runner executed as a system service. In such configuration Runner is mostly running with a root UID, but it uses another UID (eg. gitlab-runner) to execute build scripts.
Test
Let's repeat the Using normal workflow with Runner <-> GitLab communication test but with Runner executed as a system service instead of gitlab-runner run ... command. In such test - after git push - we should see an output:
Runner, when configured to execute builds as other UID than used for Runner execution itself, is using su command to change user rights. As we can see in first ps ... output, su ...'s PGID is different than bash --login's PGID:
When Runner is triggered to abort build (by a timeout or by a cancel button from GitLab's UI) it sends a SIGKILL signal to the process PGID. In previous tests when Runner was executed as gitlab-runner run without setting a different UID for build scripts, then bash --login command had the same PGID as a script executed in this shell. But when su ... was used to change UID of build executed by Runner as a system service, then Runner sent a SIGKILL signal to the su ...'s command PGID leaving rest of the command tree.
I've reproduced this on two different linux distributions so this isn't environment related bug as I thought earlier.
Conclusions
The problem exists because su ... command doesn't have the same PGID as shell and build script executed through this su ... command.
What we could do to fix this? Currently I see two options:
Try to find a PID of su ...'s command children and then send SIGKILL to su ...'s PID, and children's PGID.
Send a SIGTERMor SIGINT signal to su ... command. This will allow su to send the same signal to all children process groups before finishing itself.
Fix 2. is easier to implement, but it may not work if build process will catch the SIGTERM/SIGINT signal and handle it in different way than terminating the process. Fix 1. would kill all processes but it's harder to implement.
We could also try to mix both solutions: try to send SIGTERM/SIGINT to su ... command, wait a moment and if process is not terminated than find all children and send SIGKILL to all PGIDs.
@dblessing The fix is ready (but not reviewed nor merged yet). Can we ask customers to test this version? It can be downloaded from links in this MR: !336.
Think I'm hitting this issue -- I have a shell executor that is running a docker run command, and I had a container hanging around that the runner failed to terminate. Manually running docker rm -f on the container unstuck the runner.
Does that sounds like this issue (in which case I'll follow here waiting for a resolution), or should I open another issue to track? Thanks.
I'm also hitting this issue. I'm using shell executor and it's running docker-compose. When a build is cancelled via the GitLab web interface, the build still continues running on the CI server.
@MrChrisW sounds like maybe the docker 1.12.4 issue (think the docker bug is https://github.com/docker/docker/issues/29421) might not be the whole story here, if we're seeing problems on docker 1.12.3 and 1.12.5 as well?
Same issue. I'm using shell executor and it's running docker-compose.When a build is cancelled via the GitLab web interface, the new build will be pending always. I need to run gitlab-runner restart int the server, and then the build will run correct.
@cliffwoolley While we can assume, that su will be present on almost every *nix system, gosu is another requirement that need to be installed on a host to make Runner working. We already have some and I would like to avoid any other if they are not 100% necessary. If we need to replace su with something else I prefer to implement this inside of the Runner since it should not be hard to do (and that's what !336 is doing).
@cliffwoolley However thanks for mentioning gosu. Even if I don't like the idea of rely on it, the idea of using github.com/opencontainers/runc/libcontainer/user (used there) instead of own, regexp-based solution may be worth to consider. I will look on this :)
Same issue here. Occasionally when we are merging a bunch of MR's to master, a whole bunch of pipelines start at the same time. Not wanting all of these pipelines to run, we cancel a few but the jobs still persist on the runners, slowing things down.
It seems that if there is a long-running process underway in the script (building a nodeJS app or compiling an android app, one single command), the process is not killed. Any other commands in the script or any jobs scheduled to be run after the current job are cancelled, it's just that the current process(es) that are not killed. Is this by design?
Is this still not fixed? This is a huge issue for us, and it's depressing there's been no progress in months. It really makes git's pipelines unusable for critical deployments.