Skip to content
Snippets Groups Projects
Commit 897be1f3 authored by Marat Kalibekov's avatar Marat Kalibekov
Browse files

alert rules in runbooks

parent 59740d64
No related branches found
No related tags found
1 merge request!87Alert rules in runbooks
Loading
Loading
@@ -29,11 +29,14 @@ The aim of this project is to have a quick guide of what to do when an emergency
* [Knife ssh does not work](troubleshooting/chef.md)
* [Sidekiq Queues are out of control](troubleshooting/large-sidekiq-queue.md)
* [Workers have huge load because of cat-files](troubleshooting/workers-high-load.md)
* [Runners cache is down](troubleshooting/runners_cache_is_down.md)
* [Runners registry is down](troubleshooting/runners_registry_is_down.md)
* [Runners cache free disk space is less than 20%](troubleshooting/runners_cache_disk_space.md)
* [Kibana is down](troubleshooting/kibana_is_down.md)
 
## Alerting and monitoring
 
* [Alert creating manual](alerts/alerts_manual.md)
* [Alerts list](alerts/README.md)
* [Alert creating manual](howto/alerts_manual.md)
* [Working with Grafana](monitoring/grafana.md)
* [Working with Prometheus](monitoring/prometheus.md)
 
Loading
Loading
## Alerts list
* [Manual for creating and modifying alerts](alerts_manual.md)
### GitLab.com related
* Registry is down (TBD)
* PostgreSQL is down (TBD)
### GitLab Runners related
* [Runners cache is down](runners_cache_is_down.md)
* [Runners registry is down](runners_registry_is_down.md)
* [Runners cache free disk space is less than 20%](runners_cache_disk_space.md)
### Infrastructure related
* [Kibana is down](kibana_is_down.md)
## KIBANA IS DOWN
ALERT kibana_is_down
IF node_systemd_unit_state{name="kibana.service",state="active"} == 0
FOR 10s
LABELS {severity="critical", pager="slack", pager="pagerduty"}
ANNOTATIONS {
title="Kibana is down",
runbook="alerts/kibana_is_down.md"
}
## RUNNERS CACHE
ALERT runners_cache_is_down
IF probe_success{job="runners-cache",instance="127.0.0.1:9000"} == 0
FOR 10s
LABELS {severity="critical", pager="pagerduty"}
ANNOTATIONS {
title="Runners cache is down",
runbook="alerts/runners_cache_is_down.md",
description="This impacts CI execution builds, consider tweeting: !tweet 'CI executions are being delayed due to our runners cache being down at GitLab.com, we are investigating the root cause'"
}
## RUNNERS REGISTRY
ALERT runners_registry_is_down
IF probe_success{job="runners-cache",instance="127.0.0.1:5000"} == 0
FOR 10s
LABELS {severity="critical", pager="pagerduty"}
ANNOTATIONS {
title="Runners registry is down",
runbook="alerts/runners_registry_is_down.md",
description="This impacts CI execution builds, consider tweeting: !tweet 'CI executions are being delayed due to our runners registry being down at GitLab.com, we are investigating the root cause'"
}
## RUNNERS CACHE FREE DISK SPACE IS LESS THAN 20%
ALERT runners_cache_disk_space
IF node_filesystem_avail{device="/dev/vda1",instance="runners-cache-1"} / node_filesystem_size{device="/dev/vda1",instance="runners-cache-1"} < 0.20
FOR 1m
LABELS { severity = "critical" }
ANNOTATIONS {
title = "There is less than 20% of free disk space left on runners cache",
runbook = "alerts/runners_cache_disk_space.md",
description = 'More detailed message will be added here, but now take care of space'
}
File moved
File moved
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment