Add details about CI architecture, graphs and troubleshooting

49aa226c · Kamil Trzcinski · Alex Hanselka · 639b4857 · 49aa226c · 49aa226c
Commit 49aa226c authored 7 years ago by Kamil Trzcinski Committed by Alex Hanselka 7 years ago
--- a/README.md
+++ b/README.md
@@ -34,6 +34,9 @@ The aim of this project is to have a quick guide of what to do when an emergency
  
 ### CI
  
+* [Introduction to Shared Runners](troubleshooting/ci_introduction.md)
+* [Understand CI graphs](troubleshooting/ci_graphs.md)
+* [Large number of CI pending builds](troubleshooting/ci_pending_builds.md)
 * [The CI runner manager report a high DO Token Rate Limit usage](troubleshooting/ci_runner_manager_do_limits.md)
 * [The CI runner manager report a high number of errors](troubleshooting/ci_runner_manager_errors.md)
 * [Runners cache is down](troubleshooting/runners_cache_is_down.md)

--- a/img/ci/auto_scaling_details.png
+++ b/img/ci/auto_scaling_details.png
--- a/img/ci/jobs_graph.png
+++ b/img/ci/jobs_graph.png
--- a/img/ci/long_polling_state.png
+++ b/img/ci/long_polling_state.png
--- a/img/ci/long_polling_states.png
+++ b/img/ci/long_polling_states.png
--- a/img/ci/machine_creation.png
+++ b/img/ci/machine_creation.png
--- a/img/ci/queued_histogram.png
+++ b/img/ci/queued_histogram.png
--- a/img/ci/requests_handled.png
+++ b/img/ci/requests_handled.png
--- a/img/ci/requests_queued.png
+++ b/img/ci/requests_queued.png
--- a/img/ci/runner_details.png
+++ b/img/ci/runner_details.png
--- a/troubleshooting/ci_graphs.md
+++ b/troubleshooting/ci_graphs.md
+## CI graphs
+
+When you go to https://performance.gitlab.net/dashboard/db/ci you will see a number of graphs.
+
+This document tries to explain what you see and what each of the values does indicate.
+
+## GitLab-view of Jobs
+
+![](../img/ci/jobs_graph.png)
+
+* **pending jobs for project with shared runners enabled**:
+this is a list of "potential" builds that are in queue and could be picked by shared runners.
+As of today it represents a number of jobs for projects that have shared runners enabled
+and have pending jobs.
+Currently, we cannot filter this value with "stuck builds",
+"tag matching of runners" and "shared runners minutes".
+So it is possible that this value is artificially high.
+This is subject to be updated.
+
+* **pending jobs for a project without shared runners enabled**:
+Similar to previous one, but for projects that do not have shared runners.
+
+* **running jobs on shared runners**:
+Current value (from GitLab perspective) of jobs that run on runners marked as shared.
+
+* **running jobs on specific runners**:
+Current value (from GitLab perspective) of jobs that run on runners marked as specific.
+
+* **stale jobs**:
+Jobs that are "running" and were not updated for last hour.
+This number may indicate a number of jobs that are considered dead, due to:
+someone closing runner, runner crashing, etc.
+
+## Runner-view of Jobs
+
+![](../img/ci/runner/details.png)
+
+These graphs represent data that are exported out of Runner Manager with Prometheus Exporter.
+
+The first graph represents a number of currently running jobs on specific Runner Manager.
+
+The second graph gathers data from GitLab perspective about a number of jobs running on:
+* **private-runners** specific runners owned by GitLab Inc. (not all GitLab specific runners), currently it is: `docker-ci-X.gitlap.com`,
+* **shared-runners-gitlab-org** shared runners owned by GitLab Inc. that are used for running jobs with tag `gitlab-org`: `gitlab-shared-runners-manager-X.gitlab.com`,
+* **shared-runners** shared runners owned by GitLab Inc. that are used for running all public jobs `shared-runners-manager-X.gitlab.com`,
+
+The third graphs represent runner point of view of how many jobs are right now in given stage.
+Most of the stages are self-explanatory, excluding one:
+* **stage**: currently the name of this is stage can be considered as an error.
+This name indicates that job is currently on preparation phase: download docker images and configuring services.
+
+## Runners Manager: Auto-scaling
+
+![](../img/ci/runner/auto_scaling_details.png)
+
+This is a very important graph as it represents the health of auto-scaling.
+You can read more about auto-scaling of Docker in this document:  https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/blob/master/docs/configuration/autoscale.md#autoscaling-algorithm-and-parameters.
+
+This graph is `gauge` so it doesn't represent rate of change,
+but represent a state in given moment.
+
+The naming of groups is the same as in the previous paragraph,
+but what is interesting are the states:
+* **acquired**: the number of machines that are "locked" as used for requesting jobs from GitLab,
+it can translate to a number of requests executed by runner to job request endpoint of GitLab.
+The high number is a result of change described in this MR: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/518.
+
+* **creating**: the number of machines that are currently being provisioned that will be later used to run new builds,
+
+* **idle**: the number of machines that are currently waiting idle and if needed can be used to run new builds,
+
+* **used**: the number of machines that are currently assigned to specific received job as they are used to run job payload,
+
+* **removing**: the number of machines that are currently being removed.
+
+### How to interpret this data?
+
+The low number of **idle** means that runner manager is unable to provide machines to demand the load.
+It can be an indication of error, but may not really be.
+
+The rate of Runner Manager machine "creation" is defined by [IdleCount](https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/blob/master/docs/configuration/autoscale.md#how-current-limit-and-idlecount-generate-the-upper-limit-of-running-machines).
+We can easily increase `IdleCount`,
+but we need to be aware of rate limits of API that is used to provision new machines,
+as it is possible that system will become unstable once we hit it.
+
+## Runners Manager: rate of machine creation (Machines operations rates)
+
+![](../img/ci/machine_creation.png)
+
+This is another interesting graph that gives insight into what is happening with auto-scaling.
+This graph represents **counter** metric.
+
+The meaning of name groups is equal to the ones from the previous graph.
+
+### How to interpret this data?
+
+The low number of **created** with a low number of **idle** from the previous graph may indicate that we are unable to create new machines.
+It may be a problem of API, a problem of docker-machine, or just a bug in GitLab Runner.
+
+High number of **created** with low number of **idle** may indicate that we are creating machines,
+but these are machine are broken for some reason as they are very short living.
+
+## Jobs queue
+
+This graph represents a number of pending jobs that could be run by shared runners (but having in mind limitations described in *GitLab-view of Jobs*).
+
+There's one group **namespace** which currently indicate all namespaces that at that time had less than 10 jobs.
+This is the sink hole for all of them.
+If this value is high it means that we have a lot of jobs for many namespaces,
+which is also what should be expected.
+
+Seeing **namespace** for specific ID with very high number may indicate an abuse.
+It is worth to verify what is in that namespace.
+
+## Workhorse Long polling
+
+Workhorse Long Polling is to implement `builds/register` and `job/request` in long polling mode.
+A request that is executed for 50s, and is watching for Redis value change.
+If no value change is detected return an information that there's no build
+currently available.
+
+If the request cannot be handled it is proxied to GitLab.
+
+![](../img/ci/long_polling_states.png)
+
+This graph represents a number of hits for given state when handling job request in Workhorse:
+* **body-parse-error**, **body-read-error**, **missing-values**: we received invalid body that is too large, is of invalid content or does not have all required arguments, in this case, request is proxied to GitLab,
+* **watch-error**: we failed to start watcher process, request is proxied to GitLab,
+* **no-change**: we received notification, but the current value of notification is the same as sent by Runner, we return no new jobs to Runner,
+* **seen-change**: we received notification, and the value of notification is different then sent by Runner, we return no new jobs to Runner, runner will retry in a few seconds,
+* **timeout**: we did not receive notification and request did timeout, we return no new jobs to Runner,
+* **already-changed**: we did check the value of notification before starting watching, and it is different then sent by Runner, we proxy request to GitLab,
+
+Here we aim to minimize the number of errors, as they indicate that request cannot be long polled, due to missing data (old runners), invalid body, etc.
+
+![](../img/ci/long_polling_state.png)
+
+This graph represents a current number of requests in given state:
+* **reading**: Workhorse is reading request body, if client is slow to send body we can see some number here,
+* **watching**: Workhorse is long polling request and watching for Redis change notification,
+* **proxying**: Workhorse is proxying request to GitLab.
+
+Generally, we try to increase **watching** and minimalize **proxying**.
+Request that is in watching is long polled and executed for 50s currently.
+
+## Workhorse Queueing
+
+We use Workhorse Queueing request to limit the capacity that is given to job request endpoint.
+Only this endpoint is affected by these changes. This allows us to easily control the percentage of the resources of a single server that job requesting can use.
+
+All requests to this endpoint end up in the queue. The queue has:
+* width: how many requests we can run concurrently,
+* timeout: how long we allow request to be in queue,
+* length: how many requests we allow to be in the queue.
+
+Requests that:
+* do end up in queue, but do timeout are rejected with: **503 Service Unavailable**,
+* that cannot be put enqueued because we have more than limit are rejected with: **429 Too Many Requests**.
+
+Seeing a large number of 429 means that someone is blocking the queue and we create a too big backlog,
+but we can still process them in a reasonable time (lower than a timeout). So we reject them quickly.
+
+Seeing large number of 503 means that most of the requests that do end up in the queue do timeout,
+which also indicates that someone is blocking the queue, but also means that maybe our queue is too long.
+
+![](../img/ci/requests_handled.png)
+
+This graph represents the current state of a number of requests that are being processed.
+Seeing that value to hit the limit means that we have to queue and delay the job request processing.
+This means that API endpoint with given capacity (width/length/timeout) is underperforming.
+
+We aim to have **handled** to be as low as possible.
+
+![](../img/ci/requests_queued.png)
+
+This graph represents a current number of requests that are enqueued.
+Enqueued requests are delayed in order to slow down all runners to the point where we can process what they ask for.
+High value indicates the same as previously, API endpoint with given capacity is underperforming.
+
+![](../img/ci/queued_histogram.png)
+
+This graph represents the delay introduced to requests.
+The high delay means that we have to significantly slow down job requests to handle the demand.
+
+We expect this value to be as low as possible.
+
+## Runners uptime
+
+This graph represents runners uptime. We expect the values to not change very often.
+
+If we see that some of the runners die at very unpredictable times it is an indicator of Runner Manager crashes.
+
+We should log into this runner and check if we did have panics in last time:
+
+```
+grep panic /var/log/syslog
+```
--- a/troubleshooting/ci_introduction.md
+++ b/troubleshooting/ci_introduction.md
+## CI troubleshooting introduction
+
+GitLab.com and dev.gitlab.org shared runners consists from a number of components and machines.
+
+### Components
+
+We can define these components:
+
+1. GitLab Sidekiq - processes pipeline, and updates jobs statuses,
+1. GitLab Unicorn - is used to request new job, download or upload artifacts,
+1. Workhorse - is used to implement long polling and capacity limiting of `builds/register` and `job/request` endpoint,
+1. Runner Manager - is used to asking GitLab for new jobs, provision new machines and run received jobs on provisioned machines,
+1. Machine - an actual provisioned VM on which jobs are run. Usually, they consist out of Docker Engine to which Runner Manager connects and instruments jobs creation.
+
+### Data flow
+
+Let's shortly describe data flow and most crucial components of Shared Runners setup on GitLab.com and dev.gitlab.org.
+
+1. Everything starts at GitLab application level.
+1. User pushes changes to GitLab,
+1. Changes are received and being processed by Git daemon,
+1. Git daemon executes `gitlab-shell` post-receive hook of the repository,
+1. Post receive hook enqueues `PostReceive` Sidekiq job on Redis,
+1. Sidekiq job is now being executed,
+1. During PostReceive execution a `CreatePipelineService` is being fired,
+1. We read and analyze `.gitlab-ci.yml`, create `ci_pipeline` and `ci_builds` object,
+1. We then execute `ProcessPipelineWorker` on `ci_pipeline` to enqueue jobs,
+1. Any job that should be run by runner does change its state from `created` to `pending`,
+1. Runner asks either `builds/register` (old CI API), `job/request` (new API v4) endpoint,
+1. GitLab Unicorn executes SQL query that checks for list of "potential" jobs that should be executed by runner in question,
+1. We validate that potential runner can run job, if this is true we transition the job from `pending` to `running` and attach `runner_id`,
+1. Job serialized data is returned to Runner Manager,
+1. Runner Manager when receives a job is starting an executor (docker, kubernetes or docker+machine),
+1. Runner reads received payload and creates a set of containers: helper (to clone sources, to download/upload artifacts and caches), build (to run user-provided script), services (provided in `.gitlab-ci.yml`),
+1. Once all containers do finish the result of the job is sent do GitLab,
+
+### Creating machines
+
+Runner Manager does manage Machines as described in this document: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/blob/master/docs/configuration/autoscale.md#autoscaling-algorithm-and-parameters.
--- a/troubleshooting/ci_pending_builds.md
+++ b/troubleshooting/ci_pending_builds.md
+## Large CI pending builds
+
+The most comment problem is that we get a report that we have a large number of CI pending builds.
+
+1. Check `CI dashboard` and verify that we have a large number of CI builds,
+2. Verify graphs and potential outcomes out of the graphs as described in (CI graphs)[ci_graphs.md],
+3. Verify if we have [the high DO Token Rate Limit usage](ci_runner_manager_do_limits.md),
+4. Verify the number of errors [the high number of errors](ci_runner_manager_errors.md),
+5. Verify that machines are created on `shared-runners-manager-X.gitlab.com`,
+6. Verify that docker machine valid operation,
+
+## 1. Check `CI dashboard` and verify that we have a large number of CI builds
+
+Look at the graph with number of CI builds:
+![](../img/ci/jobs_graph.png)
+
+## 2. Verify graphs and potential outcomes out of the graphs as described in (CI graphs)[ci_graphs.md],
+
+To understand what can be wrong, you need to find a cause.
+
+1. Check runner auto-scaling: (CI auto-scaling graphs)[ci_graphs.md#Runners-Manager-Auto-scaling],
+   and look for the `Idle` number,
+2. Verify jobs queues: (CI auto-scaling graphs)[ci_graphs.md#Jobs-queue].
+   If you see a single namespace with a lot of builds, verify what projects are in that namespace and whether this is the abuser.
+3. Verify long polling behavior (we are not yet aware of potential problems as of now),
+4. Verify workhorse queueing: (Workhorse queueing graphs)[ci_graphs.md#Workhorse-queueing].
+   If you see a large number of requests ending up in the queue it may indicate that CI API is degraded.
+   Verify the performance of `builds/register` endpoint: https://performance.gitlab.net/dashboard/db/grape-endpoints?var-action=Grape%23POST%20%2Fbuilds%2Fregister&var-database=Production,
+5. Verify runners uptime. If you see that runners uptime is varying it does indicate that most likely Runners Manager does die, because of the crash. It will be shown in runners manager logs: `grep panic /var/log/messages`.
+
+## 3. Verify if we have [the high DO Token Rate Limit usage](ci_runner_manager_do_limits.md)
+
+You will see alerts on `#alerts` channel. It will indicate that since we are hitting API limits we will no longer be able to provision new machines.
+
+## 4. Verify the number of errors [the high number of errors](ci_runner_manager_errors.md)
+
+Generally, it is not a big problem, but it generates a lot of noise in logs. It is safe to run that runbook.
+
+You should also be aware that you should then cross-check state between digital ocean and runners manager as described in
+that issue: https://gitlab.com/gitlab-com/infrastructure/issues/921 (this should be moved to script and runbook).
+
+## 5. Verify that machines are created on `shared-runners-manager-X.gitlab.com`
+
+Login to runners manager and execute:
+
+```bash
+$ journalctl -xef | grep "Machine created"
+```
+
+You should see a constant stream of machines being created:
+
+```
+Mar 20 13:16:36 shared-runners-manager-2 gitlab-ci-multi-runner[19931]: time="2017-03-20T13:16:36Z" level=info msg="Machine created" fields.time=43.913563388s name=runner-4e4528ca-machine-1490015752-629c75cb-digital-ocean-4gb now=2017-03-20 13:16:36.246859005 +0000 UTC retries=0 time=43.913563388s
+```
+
+If you don't see it, try to debug logs from docker machine:
+
+```bash
+journalctl -xef | grep operation=create
+```
+
+```
+Mar 20 13:17:56 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:56Z" level=info msg="Running pre-create checks..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+Mar 20 13:17:57 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:57Z" level=info msg="Creating machine..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+Mar 20 13:17:57 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:57Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Creating SSH key..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+Mar 20 13:17:58 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:17:58Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Creating Digital Ocean droplet..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+Mar 20 13:18:03 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:18:03Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Waiting for IP address to be assigned to the Droplet..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+Mar 20 13:18:04 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:18:04Z" level=info msg="(runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb) Created droplet ID 42980631, IP address 159.203.179.170" driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+Mar 20 13:18:34 shared-runners-manager-2 gitlab-runner[19931]: time="2017-03-20T13:18:34Z" level=info msg="Waiting for machine to be running, this may take a few minutes..." driver=digitalocean name=runner-4e4528ca-machine-1490015876-441093ee-digital-ocean-4gb operation=create
+```
+
+If it fails to create you will see a message here.
+
+## 6. Verify that docker machine valid operation
+
+You should try to create machine manually:
+
+```bash
+$ docker-machine create -d digitalocean test-machine --digitalocean-image=coreos-stable --digitalocean-ssh-user=core --digitalocean-access-token=GET_TOKEN_FROM_ETC_GITLAB_RUNNER_CONFIG --digitalocean-region=nyc1 --digitalocean-size=2gb --digitalocean-private-networking --engine-registry-mirror=http://runners-cache-2-internal.gitlab.com:1444 --digitalocean-userdata=/etc/gitlab-runner/cloudinit.sh
+```
+
+This method should succeed. If it does not. You have to verify it.
+
+Once it is created you can log in to this created machine:
+
+```bash
+$ docker-machine ssh test-machine
+```
+
+And try to run some docker containers, to verify that networking, DNS does work properly.
+
+```bash
+$ docker run -it docker:git /bin/sh
+# git clone https://gitlab.com/gitlab-org/gitlab-ce
+```
+
+Afterward tear down the machine:
+```bash
+$ docker-machine rm test-machine
+```
+
+If it fails at any of the commands it can mean any of that:
+1. there's a problem with docker-machine creating machine,
+2. there's a problem with docker-engine on machine,
+3. there's a problem with connectivity from docker-machine.
+
+You may need to:
+1. verify if it's a problem of `docker version`,
+1. verify if it's a problem of `coreos-stable`,
+2. verify if it's a problem of networking out of the container: DNS?