This is a general Merge Request template. Consider to choose a template from the list above if it will match your case more.
What does this MR do?
This adds a new executor called
docker+slurm which runs Docker jobs on resources allocated through the cluster scheduling engine SLURM. I didn't really clean it up and add tests yet, before doing that I'd like some feedback if there is even interest in merging something like this at all.
This executor does not run the actual jobs in SLURM, it only uses SLURM to do the resource scheduling. I picked this approach because
We want to run all jobs in Docker containers, and as Docker is a global system service, it seemed weird (and probably complicated) to spin up a Docker daemon from inside a SLURM job. Apart from that, all of those SLURM jobs would basically need to be run as
I wanted to be able to piggy-back on top of the existing
dockerexecutor, in a manner similar to
docker+machine. That way, the new executor benefits from bugfixes and improvements.
In order for this approach to work, you need to run a Docker daemon on each cluster node that will run CI builds, and that daemon must be available over the network (secured with TLS and certificate authentification, of course). The actual build is directly submitted to the Docker daemon on the node allocated by SLURM and runs outside of SLURM, but is restricted to the allocated cpuset and memory.
Here is a sample configuration file:
concurrent = 100 check_interval = 0 [[runners]] name = "2c-4g-1h" limit = 30 url = "https://gitlab.example.com" token = "..." executor = "docker+slurm" [runners.docker] tls_cert_path = "/etc/gitlab-runner/pki" tls_verify = true image = "alpine" resource_limits_for_cache = true # applies resource limits to cache container as well resource_limits_for_services = true # applies resource limits to service containers as well privileged = false disable_cache = false volumes = ["/cache"] shm_size = 0 [runners.cache] [runners.slurm] partition = "gitlab-ci" # the SLURM partition to run in qos = "2c" # a SLURM QoS level (optional) max_pending_builds = 1 # don't request new builds if this many jobs are pending in the partition cpus = 4 # the number of CPUs allocated to the job memory = "4G" # the amount of memory allocated to the job kernel_memory = "4G" # the amount of kernel memory allocated to the job (see the Docker docs) timeout = 65 # the job timeout. This should be set to the timeout of the SLURM partition allocation_timeout = 240 # If the job doesn't get started by SLURM within this time, the build will fail overhead_time = 5 # This gets added to the timeout requested by the build when setting the SLURM timeout abort_on_expected_allocation_timeout = false truncate_job_timeout = true # Whether to reduce the build timeout or fail the build if the per-project value is too large allocation_script = "/cluster/bin/ciscript.sh" # The script that we run as our SLURM job user = "gitlab-ci" # User to run SLURM commands under group = "" # Group to run SLURM commands under # second runner for smaller jobs, shares the same SLURM partition [[runners]] name = "1c-2g-30m" limit=70 url = "https://gitlab.example.com" token = "..." executor = "docker+slurm" [runners.docker] tls_cert_path = "/etc/gitlab-runner/pki" tls_verify = true image = "alpine" resource_limits_for_cache = true resource_limits_for_services = true privileged = false disable_cache = false volumes = ["/cache"] shm_size = 0 [runners.cache] [runners.slurm] partition = "gitlab_ci" qos = "1c" max_pending_builds = 5 cpus = 1 memory = "2G" kernel_memory = "2G" timeout = 35 allocation_timeout = 240 overhead_time = 5 abort_on_expected_allocation_timeout = true truncate_job_timeout = false allocation_script = "/cluster/bin/ciscript.sh" user = "gitlab-ci" group = ""
As you can see, you can have multiple runners with different amounts of resources. These may also run on different SLURM queues. We just add corresponding runner labels in GitLab, e.g.
cores:2,mem:4g for the first runner above. That way, users can specify their requirements in their
.gitlab-ci.yml, and as long as there is a matching runner, it all works fine.
The way it works is basically like this:
After checking with the provider that it can handle an additional job, we request one.
The executor submits a dummy interactive SLURM job with the correct resource requirements (cores, memory, partition, QoS, runtime) using
srunand stores the SLURM job id. This job acts as a kind of handle for the cluster resources reserved for our build. We have to make sure to terminate this job as soon as the build is finished (by closing stdin of the job). The job script is very simple, but it has to scrictly adhere to the output given by the following example:
#!/bin/bash echo "gitlab-runner: hostname: $(hostname)" echo "gitlab-runner: cpuset: $(taskset -cp $$ | sed -e 's/.*: //')" exec cat # cheap way to make sure the program terminates as soon as we close stdin
(optional) If the SLURM job does not start immediately, we wait for SLURM to allocate it. During this time, we periodically write an update to the trace to let the user know about the state of its job (it's already shown as running in GitLab). This also makes sure that GitLab doesn't consider the job as stuck (normally happens if there is no update for an hour).
As soon as SLURM starts the job, its job script tells us about the host that we were allocated to as well as the cpuset that we are running on.
We set up a custom runner configuration with that host, the cpuset and further restrictions (right now, only memory) and pass that on to the wrapped
dockerexecutor, which does its job as usual.
In the cleanup phase, we make sure to terminate the SLURM job to avoid leaking resources.
Apart from adding the new executor, the MR also contains some changes to underlying components of the runner:
dockerexecutor can now limit the amount of memory and kernel memory available to the build. This is really important to avoid stepping on other jobs running on the same cluster node, as memory is often a critical resource in cluster computing.
dockerexecutor can now optionally apply resource limits to auxiliary containers like services. There are two new configuration options for this.
- The provider's
Create()methods now take the runner configuration as a parameter, and
CanCreate()(which was not used anywhere until now) now gets called to decide whether we request a new job from GitLab, in addition to checking the concurrency limit. The SLURM provider uses this to limit the number of pending jobs in the SLURM queue. Otherwise, it would accept any and all jobs and queue them in SLURM, which might starve other runners available that have capacity available.
Finally, the handling of context timeouts gets a little more complicated: Right now, the
Build creates a context with timeout before it even requests the provider. The timeout used here is the one set in the runner configuration without taking into account the per-project timeout sent by GitLab. This is problematic for a number of reasons:
- We don't know when the job will be allocated by SLURM, so we have to set a very high timeout initially.
- On the other hand, once the job has been started, we still want to limit its actual run time.
- Finally, we can improve the scheduling done by SLURM if we provide it with accurate job timeouts, e.g. if the per-runner timeout is set to 60 min, but the project timeout is only 10 min.
In order to work around this problem, this MR makes two changes:
- Instead of requesting the initial build timeout from the job, we now request it from the provider, which can take into account stuff like SLURM scheduling delays.
- The executor is now expected to support
Prepare(). This method can either return the original context with the timeout given by the provider, or a new context with a shorter timeout.
slurm provider has a setting for the maximum amount of time it waits until it gives up on the SLURM allocation and fails the job, and the
slurm executor takes the per-project timeout into account. It will only ever reduce the per-runner setting, though.
Why was this MR needed?
We (and other people, see e.h. #47) use GitLab in an academic context, where we have computing resources for testing available, but those are managed by cluster engines like SLURM. Without playing nice with that infrastructure, we cannot run tests on it.
Are there points in the code the reviewer needs to double check?
Does this MR meet the acceptance criteria?
Added for this feature/bug
All builds are passing
Branch has no merge conflicts with
master(if you do - rebase it please)