Skip to content
Snippets Groups Projects
Commit c392ea76 authored by Alex Hanselka's avatar Alex Hanselka
Browse files

Merge branch 'master' of gitlab.com:gitlab-com/runbooks

* 'master' of gitlab.com:gitlab-com/runbooks:
  Update alerts_manual.md
  How to add more slack channels to alertmanager
  Add information about available channels for alertmanager
  Add documentation about snapshot backups from chef
  Add support for killing a Sidekiq Job ID
  Update manage-chef.md
  Update oncall.md
  Update elasticsearch.md
  Update the documentation on runners cache disk space
  add link to on-call log
  Add section of things to do while on-call
parents 2a24fbc7 63f53bcb
No related branches found
No related tags found
No related merge requests found
Loading
Loading
@@ -68,6 +68,10 @@ The aim of this project is to have a quick guide of what to do when an emergency
 
## How do I
 
### On Call
* [Common tasks to perform while on-call](howto/oncall.md)
### Deploy
 
* [Get the diff between dev versions](howto/dev-environment.md#figure-out-the-diff-of-deployed-versions)
Loading
Loading
@@ -87,9 +91,10 @@ The aim of this project is to have a quick guide of what to do when an emergency
* [Use aptly](howto/aptly.md)
* [Disable PackageCloud](howto/stop-or-start-packagecloud.md)
 
### Work with the Database
### Restore Backups
 
* [Database Backups and Replication with Wal-E](howto/using-wale.md)
* [LVM and Azure snapshots](howto/manage-snapshot-backups.md)
 
### Work with storage
 
Loading
Loading
Loading
Loading
@@ -23,7 +23,7 @@ The common procedure is as follows:
ALERT runners_cache_is_down
IF probe_success{job="runners-cache", instance="localhost:9100"} == 0
FOR 10s
LABELS {severity="critical", channel="infrastructure", pager="pagerduty"}
LABELS {severity="critical", channel="production", pager="pagerduty"}
ANNOTATIONS {
title="Runners cache has been down for the past 10 seconds",
runbook="howto/howto/manage-cehpfs.md"
Loading
Loading
@@ -31,7 +31,7 @@ ALERT runners_cache_is_down
}
```
 
This will result in a critical alert posted in slack channes `#prometheus-alerts` and `#infrastructure`, pagerduty with a link to https://dev.gitlab.com/cookbooks/runbooks/blob/master/howto/manage-cehpfs.md. Important part is the end or url - `howto/manage-cehpfs.md`. It is taken from annotation `runbook`. Runbook will provide information how to manage situation alerted. Main principle of the runbook should be - `don't make me think`.
This will result in a critical alert posted in slack channes `#prometheus-alerts` and `#production`, pagerduty with a link to https://dev.gitlab.com/cookbooks/runbooks/blob/master/howto/manage-cehpfs.md. Important part is the end or url - `howto/manage-cehpfs.md`. It is taken from annotation `runbook`. Runbook will provide information how to manage situation alerted. Main principle of the runbook should be - `don't make me think`. For channel you can use `#production`, `#ci`, `#gitaly` values.
 
### What if I want to add more data?
 
Loading
Loading
@@ -53,7 +53,7 @@ All alerts are routed to slack and additionally can be paged to PagerDuty.
 
1. Since all alerts sended to slack, you can control only the type of alert.
1. All alerts will be shown in `#prometheus-alerts` channel.
1. Additionally you can send alerts to `#ci`, `#infrastructure` channels. This part controlled with the label `channel='ci'` and `channel='infrastructure'`.
1. Additionally you can send alerts to `#ci`, `#gitaly`, `#production` channels. This part controlled with the label `channel='ci'` for example.
1. Alerts with `severity=critical` are red colored messages with `.title` and link to corresponding runbook and `.description` values from alert.
1. Alerts with `severity=warn` are yellow colored messages with `.title` and link to corresponding runbook and `.description` values from alert.
1. Alerts with `severity=info` are green colored messages with `.title` and link to corresponding runbook and `.description` values from alert.
Loading
Loading
@@ -70,6 +70,30 @@ All alerts are routed to slack and additionally can be paged to PagerDuty.
 
Currently we are not using email alerting rules.
 
### Add more slack channels to alerts
1. In routes section of [alertmanager.yml template](https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/blob/master/templates/default/alertmanager.yml.erb) add following:
```
- match:
channel: gitaly
receiver: slack_gitaly
continue: true
```
1. In receivers section [alertmanager.yml template](https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/blob/master/templates/default/alertmanager.yml.erb) add following. Note that `send_resolved`, `icon_emoji`, etc values must be taken from `slack_production` receiver:
```
- name: slack_gitaly
slack_configs:
- api_url: '<%= @conf['slack']['api_url'] %>'
channel: '#gitaly'
send_resolved: true
icon_emoji: ...
title: ...
title_link: ...
text: ...
fallback: ...
```
### Note about alerts which not fit in any routes
 
1. Alerts which are routed by default route will be sent to `#prometheus-alerts` channel in slack.
Loading
Loading
Loading
Loading
@@ -47,3 +47,16 @@ curl 'http://localhost:9200/_cat/thread_pool?v'
```
curl http://localhost:9200/_template/logstash?pretty
```
### Create new index
```
curl -XPUT localhost:9200/gitlab -d '{
"settings": {
"index" : {
"number_of_shards" : 120,
"number_of_replicas": 1
}
}
}'
```
Loading
Loading
@@ -33,8 +33,8 @@ This is done locally by another chef admin:
To do this it will be necessary to create a new keypair. Because of how chef behaves the key has to be called _default_
 
* ssh into the chef server
* run `sudo -i chef-server-ctl remove-user-key _username_ default` to remove the default.
* run `sudo -i chef-server-ctl add-user-key _username_ default` to create a new default key.
* run `sudo -i chef-server-ctl delete-user-key _username_ default` to remove the default.
* run `sudo -i chef-server-ctl add-user-key _username_ --key-name default` to create a new default key.
* copy the private key generated by chef for this user.
 
If _default_ is not declared chef will use the fingerprint of the key as a name.
Loading
Loading
# Backups and restore
We currenty have multiple backup solutions:
- AWS snapshots by `ebs.gitlap.com`
- Azure snapshots by `azure.gitlap.com`
- GitLab backup for database and pages
## AWS snapshots
### Snapshots
Every night on `ebs.gitlap.com` the following snapshot script `/opt/gitlab-backup/bin/gitlab-ebs-snapshot` creates snapshots for all EC2 instances.
### Restore
https://dev.gitlab.org/cookbooks/gitlab-backup/blob/master/doc/gitlab-ebs-snapshot.md#restoring
## Azure snapshots
### Snapshots
Every night on `azure.gitlap.com` the following snapshot script `/opt/gitlab-backup/bin/gitlab-azure-snapshots` creates snapshots for the Azure instances mentioned in `/etc/gitlab-azure-snapshots.yml`.
Just add the chef role `"role[azure-snapshot]"` to a node and snapshots will be created.
### Restore
On `azure.gitlap.com` you can use script `/opt/gitlab-backup/bin/gitlab-azure-restore` to restore the snapshots and attach the data disks to a new created instance.
So to restore i.e. file-storage1.cluster.gitlab.com follow these steps:
#### Create new node
```
~/chef-repo/tools/bin/azure-create-node --role backend file-storage2.cluster.gitlab.com
```
#### Restore snapshot and attach the data disks
Login into `azure.gitlap.com`
You can get a list of available epochs by checking the snapshot info files in /var/lib/gitlab-azure-snapshots:
```
ls -al /var/lib/gitlab-azure-snapshots/
```
Restore snapshots:
```
/opt/gitlab-backup/bin/gitlab-azure-restore --epoch 1463965202 --source file-storage1.cluster.gitlab.com file-storage2.cluster.gitlab.com
```
#### Activate logical volume
Activate the logical volume in the new created node:
```
# LVM support
sudo apt-get install -y lvm2
# look for gitlab_vg on the attached drives
sudo vgchange -ay gitlab_vg
# make sure the mountpoint exists
sudo mkdir -p /var/opt/gitlab
# mount the logical volume at /var/opt/gitlab
sudo mount /dev/gitlab_vg/gitlab_com /var/opt/gitlab
```
# So you got yourself on call
To start with the rigth foot let's define a set of tasks that are nice things to do before you go any further in your week
By performing these tasks we will keep the [broken window effect](https://en.wikipedia.org/wiki/Broken_windows_theory) under control, preventing future pain and mess.
## Things to keep an eye on
### On-call log
First check [the on-call log](https://docs.google.com/document/d/1nWDqjzBwzYecn9Dcl4hy1s4MLng_uMq-8yGRMxtgK6M/edit#heading=h.nmt24c52ggf5) to familiarize yourself with what has been hapening lately, if anything is on fire it should be written down there in the **Pending actions** section
### Alerts
Start by checking how many alerts are in flight right now, to do this:
- go to the [fleet overview dashboard](https://performance.gitlab.net/dashboard/db/fleet-overview) and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is [being triggered](https://prometheus.gitlab.com/alerts) each alert here should point you to the right runbook to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
### Nodes status
Go to your chef repo and run `knife status`, if you see hosts that are red it means that chef hasn't been running there for a long time. Check in the oncall log if they are disabled for any particular reason, if they are not, and there is no mention of any ongoing issue in the on-call log, consider jumping in to check why chef has not been running there.
### Prometheus targets down
Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:
- go to the [fleet overview dashboard](https://performance.gitlab.net/dashboard/db/fleet-overview) and check the number of Targets down. It should be 0. If it is not 0
- go to the [targets down list](https://prometheus.gitlab.com/consoles/up.html) and check what is.
- try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
Loading
Loading
@@ -6,11 +6,11 @@
#
# If you need to run this on a Omnibus GitLab machine, run:
#
# sudo gitlab-rails runner /full_pathname/sq.rb [kill|show] <worker name>
# sudo gitlab-rails runner /full_pathname/sq.rb [kill|show|kill_jid] <worker name or Job ID>
#
# Or:
#
# BUNDLE_GEMFILE=/opt/gitlab/embedded/service/gitlab-rails/Gemfile /opt/gitlab/embedded/bin/bundle exec /opt/gitlab/embedded/bin/ruby sq.rb -h <hostname> -a <password> [kill|show] <worker name>
# BUNDLE_GEMFILE=/opt/gitlab/embedded/service/gitlab-rails/Gemfile /opt/gitlab/embedded/bin/bundle exec /opt/gitlab/embedded/bin/ruby sq.rb -h <hostname> -a <password> [kill|show|kill_jid] <worker name or Job ID>
#
require 'optparse'
require 'sidekiq/api'
Loading
Loading
@@ -20,14 +20,14 @@ Options = Struct.new(
:dry_run,
:hostname,
:password,
:socket,
:socket
)
 
def parse_options(argv)
options = Options.new
 
opt_parser = OptionParser.new do |opt|
opt.banner = "Usage: #{__FILE__} [options] [kill|show] <worker name>"
opt.banner = "Usage: #{__FILE__} [options] [kill|show|kill_jid] <worker name or job ID>"
 
opt.on('-a', '--auth PASSWORD', 'Redis password') do |password|
options.password = password
Loading
Loading
@@ -111,6 +111,21 @@ def kill_jobs_by_worker_name(options, worker_name)
count
end
 
def kill_job_by_id(options, job_id)
queue = Sidekiq::Queue.all
queue.each do |q|
q.each do |job|
next unless job.jid == job_id
job.delete unless options.dry_run
return true
end
end
false
end
def pretty_print(data)
data = data.sort_by { |_key, value| value }.reverse
 
Loading
Loading
@@ -119,27 +134,40 @@ def pretty_print(data)
end
end
 
def show_sidekiq_data(options)
queue_data, job_data = load_sidekiq_queue_data
puts '-----------'
puts 'Queue size:'
puts '-----------'
pretty_print(queue_data)
puts '------------------------------'
puts 'Top job counts with arguments:'
puts '------------------------------'
pretty_print(job_data)
def show_sidekiq_data
queue_data, job_data = load_sidekiq_queue_data
puts '-----------'
puts 'Queue size:'
puts '-----------'
pretty_print(queue_data)
puts '------------------------------'
puts 'Top job counts with arguments:'
puts '------------------------------'
pretty_print(job_data)
end
 
if $PROGRAM_NAME == __FILE__
options = parse_options(ARGV)
configure_sidekiq(options)
 
show_sidekiq_data(options) unless options.command.length > 0
show_sidekiq_data unless options.command.length > 0
 
case options.command[0]
when 'show'
show_sidekiq_data(options)
when 'kill_jid'
if options.command.length != 2
puts 'Specify a Job ID to kill'
exit
end
jid = options.command[1]
result = kill_job_by_id(options, jid)
if result
puts "Killed job ID #{jid}"
else
puts "Unable to find job ID #{jid}"
end
when 'kill'
if options.command.length != 2
puts 'Specify a worker (e.g. RepositoryUpdateMirrorWorker)'
Loading
Loading
Loading
Loading
@@ -4,26 +4,29 @@ Free disk space on runners cache node is less than 20%.
 
## Possible checks
 
SSH to the `runners-cache-1.gitlab.com`. You can check available space by executing `df -h | grep /dev/vda1`.
Check which directory is consuming the largest space by executing `du -h -d 1 /opt`, most of the cases it will
be either `/opt/gitlab/minio` or `/opt/gitlab/registry`.
* SSH to the `runners-cache-1.gitlab.com`.
* Check available space by executing `df -h | grep /dev/vda1`.
* Check which directory is consuming the largest space by executing `cd /; du -h -d 1 -x`, `-x` limits to a single filesystem to prevent reaching larger drives
* Most of the cases it will be either `/opt/minio` or `/opt/registry`.
* Some hosts mount the `cache`, AKA `minio` and `registry` folders in `/opt/gitlab/` using a different drive.
 
## Fixing `/opt/gitlab/minio`
## Fixing `/opt/minio` or `/opt/gitlab/cache`
 
On host there is cron job which every hour deletes files from cache which is more than 4 days old.
But you can delete files from cache which is more three days old by running following command
> Adjust these commands depending on the filesystem structure.
You can delete files from cache which is more three days old by running following command
 
```
sudo find /opt/gitlab/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -3 -exec rm -rf {} \;
sudo find /opt/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -3 -exec rm -rf {} \;
```
 
Or more than two days old
 
```
sudo find /opt/gitlab/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -2 -exec rm -rf {} \;
sudo find /opt/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -2 -exec rm -rf {} \;
```
 
## Fixing `/opt/gitlab/registry`
## Fixing `/opt/registry`
 
First, stop the registry container
 
Loading
Loading
@@ -31,10 +34,10 @@ First, stop the registry container
sudo docker stop registry
```
 
then remove everything in `/opt/gitlab/registry`
then remove everything in `/opt/registry`
 
```
sudo rm -r /opt/gitlab/registry/*
sudo rm -r /opt/registry/*
```
 
and finally start the registry container again
Loading
Loading
@@ -42,3 +45,4 @@ and finally start the registry container again
```
sudo docker start registry
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment