Skip to content
Snippets Groups Projects
Commit f209cf6d authored by Alejandro Rodríguez's avatar Alejandro Rodríguez
Browse files

Add runbooks and recording/alert rules for Gitaly

parent 4909063e
No related branches found
No related tags found
1 merge request!217Add runbooks and recording/alert rules for Gitaly
Loading
Loading
@@ -19,6 +19,7 @@ The aim of this project is to have a quick guide of what to do when an emergency
* [GitLab registry is down](troubleshooting/gitlab-registry.md)
* [Sidekiq stats no longer showing](troubleshooting/sidekiq_stats_no_longer_showing.md)
* [Sentry is down](troubleshooting/sentry-is-down.md)
* [Gitaly error rate is too high](troubleshooting/gitaly-error-rate.md)
 
### Replication fails
 
Loading
Loading
gitaly:grpc_server_handled_total:error_max_rate1m = max(sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) by (grpc_method))
## Gitaly error rate
ALERT gitaly_error_rate_too_high
IF gitaly:grpc_server_handled_total:error_max_rate1m > 5
FOR 5m
LABELS {severity="critical", channel="gitaly-alerts"}
ANNOTATIONS {
title="Gitaly error rate is too high: {{$value | printf \"%.2f\" }}",
description="Gitaly error rate for the last 20 minutes is over 5. Check Gitaly logs and consider disabling it.",
runbook="troubleshooting/gitaly_error_rate.md"
}
Loading
Loading
@@ -10,3 +10,6 @@ job_backend:haproxy_backend_response_errors_total:irate1m = sum(irate(haproxy_ba
 
# Total redis operations by command.
cmd:redis_command_call_duration_seconds_count:irate1m = sum(irate(redis_command_call_duration_seconds_count[1m])) by (cmd)
# GRPC calls handled by Gitaly
gitaly:grpc_server_handled_total:rate1m = sum(rate(grpc_server_handled_total[1m])) by (grpc_method)
# Gitaly error rate is too high
## First and foremost
*Don't Panic*
## Symptoms
* Message in prometheus-alerts _Gitaly error rate is too high_
## 1. Identify the problematic instance
- Go to https://performance.gitlab.net/dashboard/db/gitaly?panelId=2&fullscreen and
identify the instance with a high error rate.
- ssh into that instance and check the log for its Gitaly server for post-mortem:
```
sudo less /var/log/gitlab/gitaly/current
```
## 2. Disable Gitaly
- Update the relevant role for the problematic instance on chef-repo and change the gitaly override to `enable: false` (under override_attributes -> omnibus-gitlab -> gitlab_rb -> gitaly)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment