WIP: docker+slurm executor

This is a general Merge Request template. Consider to choose a template from the list above if it will match your case more.

What does this MR do?

This adds a new executor called docker+slurm which runs Docker jobs on resources allocated through the cluster scheduling engine SLURM. I didn't really clean it up and add tests yet, before doing that I'd like some feedback if there is even interest in merging something like this at all.

This executor does not run the actual jobs in SLURM, it only uses SLURM to do the resource scheduling. I picked this approach because

  1. We want to run all jobs in Docker containers, and as Docker is a global system service, it seemed weird (and probably complicated) to spin up a Docker daemon from inside a SLURM job. Apart from that, all of those SLURM jobs would basically need to be run as root.

  2. I wanted to be able to piggy-back on top of the existing docker executor, in a manner similar to docker+machine. That way, the new executor benefits from bugfixes and improvements.

In order for this approach to work, you need to run a Docker daemon on each cluster node that will run CI builds, and that daemon must be available over the network (secured with TLS and certificate authentification, of course). The actual build is directly submitted to the Docker daemon on the node allocated by SLURM and runs outside of SLURM, but is restricted to the allocated cpuset and memory.

Here is a sample configuration file:

concurrent = 100
check_interval = 0

[[runners]]
  name = "2c-4g-1h"
  limit = 30
  url = "https://gitlab.example.com"
  token = "..."
  executor = "docker+slurm"
  [runners.docker]
    tls_cert_path = "/etc/gitlab-runner/pki"
    tls_verify = true
    image = "alpine"
    resource_limits_for_cache = true     # applies resource limits to cache container as well
    resource_limits_for_services = true  # applies resource limits to service containers as well
    privileged = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]
  [runners.slurm]
    partition = "gitlab-ci"                  # the SLURM partition to run in
    qos = "2c"                               # a SLURM QoS level (optional)
    max_pending_builds = 1                   # don't request new builds if this many jobs are pending in the partition
    cpus = 4                                 # the number of CPUs allocated to the job
    memory = "4G"                            # the amount of memory allocated to the job
    kernel_memory = "4G"                     # the amount of kernel memory allocated to the job (see the Docker docs)
    timeout = 65                             # the job timeout. This should be set to the timeout of the SLURM partition
    allocation_timeout = 240                 # If the job doesn't get started by SLURM within this time, the build will fail
    overhead_time = 5                        # This gets added to the timeout requested by the build when setting the SLURM timeout
    abort_on_expected_allocation_timeout = false
    truncate_job_timeout = true              # Whether to reduce the build timeout or fail the build if the per-project value is too large
    allocation_script = "/cluster/bin/ciscript.sh" # The script that we run as our SLURM job
    user = "gitlab-ci"                       # User to run SLURM commands under
    group = ""                               # Group to run SLURM commands under

# second runner for smaller jobs, shares the same SLURM partition
[[runners]]
  name = "1c-2g-30m"
  limit=70
  url = "https://gitlab.example.com"
  token = "..."
  executor = "docker+slurm"
  [runners.docker]
    tls_cert_path = "/etc/gitlab-runner/pki"
    tls_verify = true
    image = "alpine"
    resource_limits_for_cache = true
    resource_limits_for_services = true
    privileged = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
  [runners.cache]
  [runners.slurm]
    partition = "gitlab_ci"
    qos = "1c"
    max_pending_builds = 5
    cpus = 1
    memory = "2G"
    kernel_memory = "2G"
    timeout = 35
    allocation_timeout = 240
    overhead_time = 5
    abort_on_expected_allocation_timeout = true
    truncate_job_timeout = false
    allocation_script = "/cluster/bin/ciscript.sh"
    user = "gitlab-ci"
    group = ""

As you can see, you can have multiple runners with different amounts of resources. These may also run on different SLURM queues. We just add corresponding runner labels in GitLab, e.g. cores:2,mem:4g for the first runner above. That way, users can specify their requirements in their .gitlab-ci.yml, and as long as there is a matching runner, it all works fine.

The way it works is basically like this:

  1. After checking with the provider that it can handle an additional job, we request one.

  2. The executor submits a dummy interactive SLURM job with the correct resource requirements (cores, memory, partition, QoS, runtime) using srun and stores the SLURM job id. This job acts as a kind of handle for the cluster resources reserved for our build. We have to make sure to terminate this job as soon as the build is finished (by closing stdin of the job). The job script is very simple, but it has to scrictly adhere to the output given by the following example:

#!/bin/bash
echo "gitlab-runner: hostname: $(hostname)"
echo "gitlab-runner: cpuset: $(taskset -cp $$ | sed -e 's/.*: //')"
exec cat # cheap way to make sure the program terminates as soon as we close stdin
  1. (optional) If the SLURM job does not start immediately, we wait for SLURM to allocate it. During this time, we periodically write an update to the trace to let the user know about the state of its job (it's already shown as running in GitLab). This also makes sure that GitLab doesn't consider the job as stuck (normally happens if there is no update for an hour).

  2. As soon as SLURM starts the job, its job script tells us about the host that we were allocated to as well as the cpuset that we are running on.

  3. We set up a custom runner configuration with that host, the cpuset and further restrictions (right now, only memory) and pass that on to the wrapped docker executor, which does its job as usual.

  4. In the cleanup phase, we make sure to terminate the SLURM job to avoid leaking resources.

Apart from adding the new executor, the MR also contains some changes to underlying components of the runner:

  • The docker executor can now limit the amount of memory and kernel memory available to the build. This is really important to avoid stepping on other jobs running on the same cluster node, as memory is often a critical resource in cluster computing.
  • The docker executor can now optionally apply resource limits to auxiliary containers like services. There are two new configuration options for this.
  • The provider's CanCreate() and Create() methods now take the runner configuration as a parameter, and CanCreate() (which was not used anywhere until now) now gets called to decide whether we request a new job from GitLab, in addition to checking the concurrency limit. The SLURM provider uses this to limit the number of pending jobs in the SLURM queue. Otherwise, it would accept any and all jobs and queue them in SLURM, which might starve other runners available that have capacity available.

Finally, the handling of context timeouts gets a little more complicated: Right now, the Build creates a context with timeout before it even requests the provider. The timeout used here is the one set in the runner configuration without taking into account the per-project timeout sent by GitLab. This is problematic for a number of reasons:

  • We don't know when the job will be allocated by SLURM, so we have to set a very high timeout initially.
  • On the other hand, once the job has been started, we still want to limit its actual run time.
  • Finally, we can improve the scheduling done by SLURM if we provide it with accurate job timeouts, e.g. if the per-runner timeout is set to 60 min, but the project timeout is only 10 min.

In order to work around this problem, this MR makes two changes:

  • Instead of requesting the initial build timeout from the job, we now request it from the provider, which can take into account stuff like SLURM scheduling delays.
  • The executor is now expected to support GetContext() after Prepare(). This method can either return the original context with the timeout given by the provider, or a new context with a shorter timeout.

The slurm provider has a setting for the maximum amount of time it waits until it gives up on the SLURM allocation and fails the job, and the slurm executor takes the per-project timeout into account. It will only ever reduce the per-runner setting, though.

Why was this MR needed?

We (and other people, see e.h. #47) use GitLab in an academic context, where we have computing resources for testing available, but those are managed by cluster engines like SLURM. Without playing nice with that infrastructure, we cannot run tests on it.

Are there points in the code the reviewer needs to double check?

Does this MR meet the acceptance criteria?

  • Documentation created/updated
  • Tests
    • Added for this feature/bug
    • All builds are passing
  • Branch has no merge conflicts with master (if you do - rebase it please)

What are the relevant issue numbers?

This originally came up a long time ago in #47 and was also discussed in #317.