Skip to content
Snippets Groups Projects
Commit 72222e58 authored by Tomasz Maczukin's avatar Tomasz Maczukin Committed by Marat Kalibekov
Browse files

Improve CI alerts

parent 65c90492
No related branches found
No related tags found
1 merge request!236Improve CI alerts
## Pending builds for projects with shared runners enabled
ALERT TooManyPendingBuildsOnSharedRunnerProject
IF topk(1, predict_linear(ci_builds_total{shared_runners_enabled_projects="1",status="pending",fqdn="db1.cluster.gitlab.com"}[30m], 3600)) > 1000
FOR 5m
LABELS {channel="ci"}
ANNOTATIONS {
title="The number of pending builds for projects with shared runners will be too high in 1h: {{$value | printf \"%.2f\" }}",
description="The number of pending builds for projects with shared runners is increasing and will be too high in 1h ({{$value}}). This may suggest problems with auto-scaling provider or Runner stability. You should check Runner's logs. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## Pending jobs per namespace over limit
ALERT TooManyPendingJobsPerNamespace
IF max(ci_pending_builds{shared_runners="yes",namespace!=""}) > 500
FOR 1m
LABELS {channel="production"}
ANNOTATIONS {
title="Number of pending jobs per namespace too high: {{$value}}",
description="Number of pending jobs per namespace for projects with shared runners enabled is too high ({{$value}}). Check https://performance.gitlab.net/dashboard/db/ci?panelId=33&fullscreen",
runbook="troubleshooting/ci_pending_builds.md#2-verify-graphs-and-potential-outcomes-out-of-the-graphs-as-described-in-ci-graphsci_graphsmd"
}
## Pending jobs for projects with shared runners enabled
ALERT TooManyPendingBuildsOnSharedRunnerProject
IF topk(1, predict_linear(ci_pending_builds{shared_runners="yes"}[30m], 3600)) > 1000
FOR 5m
LABELS {severity="warn", channel="ci-cd"}
ANNOTATIONS {
title="The number of pending builds for projects with shared runners will be too high in 1h: {{$value | printf \"%.2f\" }}",
description="The number of pending builds for projects with shared runners is increasing and will be too high in 1h ({{$value}}). This may suggest problems with auto-scaling provider or Runner stability. You should check Runner's logs. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## Pending jobs per namespace over limit
ALERT TooManyPendingJobsPerNamespace
IF max(ci_pending_builds{shared_runners="yes",namespace!=""}) > 500
FOR 1m
LABELS {severity="warn", channel="production"}
ANNOTATIONS {
title="Number of pending jobs per namespace too high: {{$value}}",
description="Number of pending jobs per namespace for projects with shared runners enabled is too high ({{$value}}). Check https://performance.gitlab.net/dashboard/db/ci?panelId=33&fullscreen",
runbook="troubleshooting/ci_pending_builds.md#2-verify-graphs-and-potential-outcomes-out-of-the-graphs-as-described-in-ci-graphsci_graphsmd"
}
## Runners manager jobs
ALERT NoJobsOnSharedRunners
IF sum(ci_runner_builds{job="shared-runners"}) == 0
FOR 5m
LABELS {severity="warn", channel="ci-cd"}
ANNOTATIONS {
title="Number of builds running on shared runners is too low: {{$value}}",
description="Number of builds running on shared runners for the last 5 minutes is 0. This may suggest problems with auto-scaling provider or Runner stability. You should check Runner's logs. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## Runners manager status
ALERT RunnersManagerDown
IF up{job=~"shared-runners|shared-runners-gitlab-org|private-runners"} == 0
FOR 5m
LABELS {severity="critical", pager="pagerduty"}
ANNOTATIONS {
title="Runners manager is down on {{ $labels.instance }}",
runbook="troubleshooting/runners_manager_is_down.md",
description="This impacts CI execution builds, consider tweeting: !tweet 'Builds are being delayed due to our shared runners manager being non responsive. We are restarting it to restore the service and then investigating the root cause'. Hosts impacted - {{ $labels.instance }}"
}
## Machine operations rate
ALERT RunnerMachineCreationRateHigh
IF sum(ci_docker_machines_provider{state="creating"}) / (sum(ci_docker_machines_provider{state="idle"}) + 1) > 100
FOR 1m
LABELS {severity="warn", channel="production"}
ANNOTATIONS {
title="Machine creation rate for runners is too high: {{$value | printf \"%.2f\" }}",
description="Machine creation rate for the last 1 minute is at least {{$value}} times greater than machines idle rate. This may by a symptom of problems with the auto-scaling provider. Check http://performance.gitlab.net/dashboard/db/ci.",
runbook="troubleshooting/ci_graphs.md#runners-manager-auto-scaling"
}
## RUNNERS NGINX,DOCKER,REGISTRY,CACHE
gitlab:runners_cache_registry_docker_service = label_replace(
drop_common_labels(
Loading
Loading
## Machine operations rate
ALERT RunnerMachineCreationRateHigh
IF max(sum without(instance) (rate(ci_docker_machines{type="created"}[20m]))) > 5
LABELS {channel="ci"}
ANNOTATIONS {
title="Machine creation rate for runners is too high: {{$value | printf \"%.2f\" }}",
description="Machine creation rate for the last 20 minutes is over 5. This may by a symptom of problems with the auto-scaling provider. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## Shared runners machines creation rate
ALERT SharedRunnerMachineCreationRateLow
IF max(sum without(instance) (rate(ci_docker_machines{type="created",job="shared-runners"}[10m]))) < 0.1
LABELS {channel="ci"}
ANNOTATIONS {
title="Machine creation rate for shared runners is too low: {{$value | printf \"%.2f\" }}",
description="Machine creation rate for shared runners for the last 10 minutes is less than 0.1. This may suggest problems with the auto-scaling provider. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## Shared runners machines usage rate
ALERT SharedRunnerMachineUsageRateLow
IF max(sum without(instance) (rate(ci_docker_machines{type="used",job="shared-runners"}[10m]))) < 0.1
LABELS {channel="ci"}
ANNOTATIONS {
title="Machine usage rate for shared runners is too low: {{$value | printf \"%.2f\" }}",
description="Machine usage rate for shared runners for the last 10 minutes is less than 0.1. This may suggest problems with the auto-scaling provider. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## Runners manager builds
ALERT NoBuildsOnSharedRunners
IF sum by(job) (ci_runner_builds{job="shared-runners"}) == 0
FOR 5m
LABELS {channel="ci"}
ANNOTATIONS {
title="Number of builds running on shared runners is too low: {{$value}}",
description="Number of builds running on shared runners for the last 5 minutes is 0. This may suggest problems with auto-scaling provider or Runner stability. You should check Runner's logs. Check http://performance.gitlab.net/dashboard/db/ci.",
}
## RUNNERS MANAGER STATUS
ALERT RunnersManagerDown
IF up{job=~"shared-runners|private-runners"} == 0
FOR 5m
LABELS {severity="critical", pager="pagerduty"}
ANNOTATIONS {
title="Runners manager is down on {{ $labels.instance }}",
runbook="troubleshooting/runners_manager_is_down.md",
description="This impacts CI execution builds, consider tweeting: !tweet 'Builds are being delayed due to our shared runners manager being non responsive. We are restarting it to restore the service and then investigating the root cause'. Hosts impacted - {{ $labels.instance }}"
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment