GitLab runners (shared and internal) problem at 2017-01-04 - postmortem

This is a postmortem of what happened with GitLab runners (shared and internal) at 2017-04-01.

As I'm not sure if this description contains any security data (like the fact, that we're using our own version of docker-machine) I'm opening this as a confidential issue. This can be changed later.

First reports and first checks

At 2017-04-01 evening (~21:42 UTC) users started to notify us, that builds at GitLab.com aren't starting - https://gitlab.slack.com/archives/infrastructure/p1483566167005496:

connor [21:42]
@northrup I think there are problems with the runners
a user in IRC and a few on Twitter are mentioning problems

Because I've just noticed the same (release build for GitLab Runner 1.9.2 was pending since more than hour) I've jumped to see what's happening. After looking into logs on shared-runners-manager-1.gitlab.com and shared-runners-manager-2.gitlab.com I've noticed, that:

there are lot of lines recording, that machines creations are failing:

Jan  4 22:02:19 shared-runners-manager-1.gitlab.com gitlab-ci-multi-runner[30042]: time="2017-01-04T22:02:19Z" level=error msg="Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded" name=runner-XYZ operation=provision
Jan  4 22:02:19 shared-runners-manager-1.gitlab.com gitlab-ci-multi-runner[30042]: time="2017-01-04T22:02:19Z" level=error msg="Machine creation failed" error="exit status 1" fields.time=3m18.414396122s name=runner-XYZ time=3m18.414396122s
Jan  4 22:02:19 shared-runners-manager-1.gitlab.com gitlab-ci-multi-runner[30042]: time="2017-01-04T22:02:19Z" level=warning msg="Removing machine" created=3m18.415029648s name=runner-XYZ now=2017-01-04 22:02:19.448757963 +0000 UTC reason="Failed to create" used=3m18.415033487s

last build on docker machine was started at 20:51 UTC:

Jan  4 20:51:10 shared-runners-manager-1.gitlab.com gitlab-ci-multi-runner[30042]: time="2017-01-04T20:51:10Z" level=info msg="Starting docker-machine build..." build=8196083 created=2017-01-04 20:49:12.075580173 +0000 UTC docker="REDACTED" name=runner-ZYX now=2017-01-04 20:51:10.269379183 +0000 UTC project=1959952 runner=REDACTED usedcount=1

Also graphs in grafana showed, that something is bad with shared runners:

I have a feeling, but it needs to be confirmed

I've assumed, that there is something bad with hosts creation done by docker-machine. To confirm this I've made a manual test:

root@shared-runners-manager-1:~# docker-machine create --driver=digitalocean [DO related options from config.toml] test-machine-1
(... normal logs for docker-machine, like "waiting for SSH" ...)
Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded
root@shared-runners-manager-1:~#

So, machine was provisioned, but there was an error with starting Docker daemon. To check this I've SSH-ed to the machine:

root@shared-runners-manager-1:~# docker-machine ssh test-machine-1
Last login: Wed Jan  4 22:49:55 UTC 2017 from REDACTED on ssh
Container Linux by CoreOS stable (1235.4.0)
Failed Units: 1
  docker.service

CoreOS reported, that docker.service systemd unit didn't start properly. Let's check why:

core@test-machine-1 ~ $ systemctl status docker.service
● docker.service - Docker Socket for the API
   Loaded: loaded (/etc/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2017-01-04 22:49:27 UTC; 1min 8s ago
  Process: 1583 ExecStart=/usr/lib/coreos/dockerd daemon --host=unix:///var/run/docker.sock --host=tcp://0.0.0.0:2376 --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/do...
 Main PID: 1583 (code=exited, status=1/FAILURE)

Jan 04 22:49:25 test-machine-1 systemd[1]: Started Docker Socket for the API.
Jan 04 22:49:27 test-machine-1 dockerd[1583]: dockerd: "dockerd" requires 0 arguments.
Jan 04 22:49:27 test-machine-1 dockerd[1583]: Usage:        dockerd [OPTIONS]
Jan 04 22:49:27 test-machine-1 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jan 04 22:49:27 test-machine-1 systemd[1]: docker.service: Unit entered failed state.
Jan 04 22:49:27 test-machine-1 systemd[1]: docker.service: Failed with result 'exit-code'.

So, the service isn't starting because of unexpected daemon argument to /usr/lib/coreos/dockerd command - that's not a good information, because this configuration is done automatically.

I've remember, that few days ago we've switched from coreos-beta to coreos-stable image for our auto-scaled DO droplets. So the next step was to repeat the manual test, but using coreos-beta image. The test ended with the same problem - machine is provisioned, Docker daemon is not responding - ssh, check service status: not starting because of unexpected argument for the command.

So the image change is not the cause of the problem. This is also confirmed by the pending builds graph - coreos-stable image was used for last few days, and the problem started to be visible at ~13:00 UTC yesterday and escalated at ~21:30 UTC.

We're almost there, but still need to check few things

First idea - if we're using the same image, but the problem came out of nowhere, then maybe it's a problem with images? Maybe they were updated?

I've send a question to DO. But I also decided to make one more test - create a clean droplet based on coreos-stable image without docker-machine help, to check if the unit definition is invalid. And now a surprise - ~~clean droplet doesn't contain docker at all! OK, so this has to be installed in invalid way by docker-machine and used image has nothing to do here.~~ (edited: pleas read https://gitlab.com/gitlab-com/infrastructure/issues/957#note_20955480; docker was probably present, but there was no systemd configuration for docker service. Assumption that it's docker-machine which was installing this file in invalid way is still right).

I've canceled the alarm in DigitalOcean and opened docker-machine sources. The cause needs to be there!

Got you!

Before we go further - one notice. For some reasons we're using our own compilation of docker-machine, which is based on the official v0.8.0 version. Why this is done like this it's out of the scope of this problem (just like the question "can we switch back to the official version?"). But it's good to know that we have a version compiled from our own sources. And - as future will show - this will help us a lot in a few moments.

After few minutes I've found, that the configuration file is created by libmachine/provision/coreos.go. I'm linking official sources, but in our fork this file is exactly the same. And here we have the culprit. dockerd daemon is hardcoded into the unit file. OK, so let's check how it looks in master branch. A way better! The argument is added only to versions older than 1.12. And yes, if you're using Docker you're starting to see that something strange is happening here, but I'll get back to this later.

OK, so our modified version (v0.8.0) is bad, current master is good. But current master is the development version - let's see how it looks in the current stable version. Quick look on releases, and we see it's v0.8.2. The same as v0.8.0!

So the current stable version of docker-machine contains the same "bug". And what is strange is that command should be changed for Docker versions >= 1.12.0, 1.12 version of Docker was released something about two months ago (we have 1.12.5 currently) and docker-machine was working properly with dockerd daemon command untill today, when it started to complain about unexpected argument for dockerd. Huh!

We know where we are. Now, let's get out of here!

Because we currently use our own compiled version of docker-machine the problem seemed to be a relatively easy to fix. The libmachine/provision/coreos.go file was updated with version check and the argument for dockerd command was made optional. Manual compilation, manual test and solution is confirmed - it's working.

The new version of docker-machine was deployed on shared-runners-manager-1 and shared-runners-manager-2 and after few minutes graphs at grafana showed, that pending builds are starting to be executed.

Few minutes of waiting to be 100% sure that problem was resolved and after this other four managers were updated.

The problem is resolved. We can go sleep! 😃

Not so fast!

OK, not yet.

We know what happened, we know what helped. We still need to know why this happened and how to notice a similar problems earlier in the future.

For the question why - I have no idea. Systemd configuration was provisioned in the same way since Jun/July 2016. Version 1.12.0 of Docker was released at 2016-07-28 and version 1.12.3 (which I found on the machine after logging on it) was released at 2016-10-26. Since then until yesterday this configuration was working without problems. And now it's not. And yes - manual test of running dockerd daemon is throwing the message dockerd: "dockerd" requires 0 arguments.. Were docker binaries updated yesterday? I have no clue here and I think it would be good to use some contacts with Docker people to find out why this happened. Graph for builds metrics suggest, that something happened at ~2016-01-04T13:00:00.

For the question how to discover such problems in the future, the discussion is open. But I have some initial ideas.

Shared runners pending builds metric

This is probably not so good name, because this is not a graph of pending builds for shared runners, but pending builds for projects with shared runners enabled. But while looking on this graph we can see, that the number of such pending builds started to be a way bigger than in previous few hours. Querying grafana for last 7 days shows, that the maximum value (~2.5K of pending builds) was more than 5 times bigger than previous maximum in last 7 days (~0.5K of pending builds). The problem was "obvious" for logs reader after ~21:00 UTC, so when pending builds count reached ~1.2-1.3K, but if we would be noticed that pending builds count reached 2*maximum_from_last_7_days with some magical, mathematic formula to include also average from this period, then maybe we would found this problem 6 hours earlier.

Idea: think about setting a maximum threshold for the number of pending builds (in correlation with an average).
Builds running on runners owned by GitLab Inc. metric

O! This one is even better. When looking at shared runners graph we can see, that at ~21:10 UTC number of builds executed on shared runners dropped to zero. And number of builds on shared runners (almost?) never dropped to zero!

21:00 UTC is few hours after the problem probably showed up (pending builds count is growing since ~13:00 UTC, so since 8 hours!) but it's few minutes before first user notifications are received and few minutes after last taken build was started. At this moment we could tweet about problem and make users aware of it before they've noticed the problem by themselves.

Idea: Set an alert if number of builds on shared runners (and probably other also) is 0.
Rate of machines creation/deletion metric

Looking at the shared-runners created and shared-runners used graphs here we can see, that ~20:52 UTC rate of created and used machines dropped to 0. This is exactly the same time, when last log line "Starting docker-machine build..." was reported. Like the previous one it's almost at the time, when the problem was seen by users and escalated, but it's still almost half hour of time to check what's happening and tweet that there are problems that we're investigating.

Idea: Set an alert if rate of machines creation and usage for shared runners (and probably other also) is 0.

What can we do more? The discussion is opened, feel free to ask questions and propose solutions/actions to be taken.

Admin message