WIP: docker+slurm executor
This is a general Merge Request template. Consider to choose a template from the list above if it will match your case more.
What does this MR do?
This adds a new executor called docker+slurm
which runs Docker jobs on resources allocated through the cluster scheduling engine SLURM. I didn't really clean it up and add tests yet, before doing that I'd like some feedback if there is even interest in merging something like this at all.
This executor does not run the actual jobs in SLURM, it only uses SLURM to do the resource scheduling. I picked this approach because
-
We want to run all jobs in Docker containers, and as Docker is a global system service, it seemed weird (and probably complicated) to spin up a Docker daemon from inside a SLURM job. Apart from that, all of those SLURM jobs would basically need to be run as
root
. -
I wanted to be able to piggy-back on top of the existing
docker
executor, in a manner similar todocker+machine
. That way, the new executor benefits from bugfixes and improvements.
In order for this approach to work, you need to run a Docker daemon on each cluster node that will run CI builds, and that daemon must be available over the network (secured with TLS and certificate authentification, of course). The actual build is directly submitted to the Docker daemon on the node allocated by SLURM and runs outside of SLURM, but is restricted to the allocated cpuset and memory.
Here is a sample configuration file:
concurrent = 100
check_interval = 0
[[runners]]
name = "2c-4g-1h"
limit = 30
url = "https://gitlab.example.com"
token = "..."
executor = "docker+slurm"
[runners.docker]
tls_cert_path = "/etc/gitlab-runner/pki"
tls_verify = true
image = "alpine"
resource_limits_for_cache = true # applies resource limits to cache container as well
resource_limits_for_services = true # applies resource limits to service containers as well
privileged = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
[runners.cache]
[runners.slurm]
partition = "gitlab-ci" # the SLURM partition to run in
qos = "2c" # a SLURM QoS level (optional)
max_pending_builds = 1 # don't request new builds if this many jobs are pending in the partition
cpus = 4 # the number of CPUs allocated to the job
memory = "4G" # the amount of memory allocated to the job
kernel_memory = "4G" # the amount of kernel memory allocated to the job (see the Docker docs)
timeout = 65 # the job timeout. This should be set to the timeout of the SLURM partition
allocation_timeout = 240 # If the job doesn't get started by SLURM within this time, the build will fail
overhead_time = 5 # This gets added to the timeout requested by the build when setting the SLURM timeout
abort_on_expected_allocation_timeout = false
truncate_job_timeout = true # Whether to reduce the build timeout or fail the build if the per-project value is too large
allocation_script = "/cluster/bin/ciscript.sh" # The script that we run as our SLURM job
user = "gitlab-ci" # User to run SLURM commands under
group = "" # Group to run SLURM commands under
# second runner for smaller jobs, shares the same SLURM partition
[[runners]]
name = "1c-2g-30m"
limit=70
url = "https://gitlab.example.com"
token = "..."
executor = "docker+slurm"
[runners.docker]
tls_cert_path = "/etc/gitlab-runner/pki"
tls_verify = true
image = "alpine"
resource_limits_for_cache = true
resource_limits_for_services = true
privileged = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
[runners.cache]
[runners.slurm]
partition = "gitlab_ci"
qos = "1c"
max_pending_builds = 5
cpus = 1
memory = "2G"
kernel_memory = "2G"
timeout = 35
allocation_timeout = 240
overhead_time = 5
abort_on_expected_allocation_timeout = true
truncate_job_timeout = false
allocation_script = "/cluster/bin/ciscript.sh"
user = "gitlab-ci"
group = ""
As you can see, you can have multiple runners with different amounts of resources. These may also run on different SLURM queues. We just add corresponding runner labels in GitLab, e.g. cores:2,mem:4g
for the first runner above. That way, users can specify their requirements in their .gitlab-ci.yml
, and as long as there is a matching runner, it all works fine.
The way it works is basically like this:
-
After checking with the provider that it can handle an additional job, we request one.
-
The executor submits a dummy interactive SLURM job with the correct resource requirements (cores, memory, partition, QoS, runtime) using
srun
and stores the SLURM job id. This job acts as a kind of handle for the cluster resources reserved for our build. We have to make sure to terminate this job as soon as the build is finished (by closing stdin of the job). The job script is very simple, but it has to scrictly adhere to the output given by the following example:
#!/bin/bash
echo "gitlab-runner: hostname: $(hostname)"
echo "gitlab-runner: cpuset: $(taskset -cp $$ | sed -e 's/.*: //')"
exec cat # cheap way to make sure the program terminates as soon as we close stdin
-
(optional) If the SLURM job does not start immediately, we wait for SLURM to allocate it. During this time, we periodically write an update to the trace to let the user know about the state of its job (it's already shown as running in GitLab). This also makes sure that GitLab doesn't consider the job as stuck (normally happens if there is no update for an hour).
-
As soon as SLURM starts the job, its job script tells us about the host that we were allocated to as well as the cpuset that we are running on.
-
We set up a custom runner configuration with that host, the cpuset and further restrictions (right now, only memory) and pass that on to the wrapped
docker
executor, which does its job as usual. -
In the cleanup phase, we make sure to terminate the SLURM job to avoid leaking resources.
Apart from adding the new executor, the MR also contains some changes to underlying components of the runner:
- The
docker
executor can now limit the amount of memory and kernel memory available to the build. This is really important to avoid stepping on other jobs running on the same cluster node, as memory is often a critical resource in cluster computing. - The
docker
executor can now optionally apply resource limits to auxiliary containers like services. There are two new configuration options for this. - The provider's
CanCreate()
andCreate()
methods now take the runner configuration as a parameter, andCanCreate()
(which was not used anywhere until now) now gets called to decide whether we request a new job from GitLab, in addition to checking the concurrency limit. The SLURM provider uses this to limit the number of pending jobs in the SLURM queue. Otherwise, it would accept any and all jobs and queue them in SLURM, which might starve other runners available that have capacity available.
Finally, the handling of context timeouts gets a little more complicated: Right now, the Build
creates a context with timeout before it even requests the provider. The timeout used here is the one set in the runner configuration without taking into account the per-project timeout sent by GitLab. This is problematic for a number of reasons:
- We don't know when the job will be allocated by SLURM, so we have to set a very high timeout initially.
- On the other hand, once the job has been started, we still want to limit its actual run time.
- Finally, we can improve the scheduling done by SLURM if we provide it with accurate job timeouts, e.g. if the per-runner timeout is set to 60 min, but the project timeout is only 10 min.
In order to work around this problem, this MR makes two changes:
- Instead of requesting the initial build timeout from the job, we now request it from the provider, which can take into account stuff like SLURM scheduling delays.
- The executor is now expected to support
GetContext()
afterPrepare()
. This method can either return the original context with the timeout given by the provider, or a new context with a shorter timeout.
The slurm
provider has a setting for the maximum amount of time it waits until it gives up on the SLURM allocation and fails the job, and the slurm
executor takes the per-project timeout into account. It will only ever reduce the per-runner setting, though.
Why was this MR needed?
We (and other people, see e.h. #47) use GitLab in an academic context, where we have computing resources for testing available, but those are managed by cluster engines like SLURM. Without playing nice with that infrastructure, we cannot run tests on it.
Are there points in the code the reviewer needs to double check?
Does this MR meet the acceptance criteria?
-
Documentation created/updated - Tests
-
Added for this feature/bug -
All builds are passing
-
-
Branch has no merge conflicts with master
(if you do - rebase it please)
What are the relevant issue numbers?
This originally came up a long time ago in #47 and was also discussed in #317.