Merge branch 'master' of gitlab.com:gitlab-com/runbooks

* 'master' of gitlab.com:gitlab-com/runbooks: Update alerts_manual.md How to add more slack channels to alertmanager Add information about available channels for alertmanager Add documentation about snapshot backups from chef Add support for killing a Sidekiq Job ID Update manage-chef.md Update oncall.md Update elasticsearch.md Update the documentation on runners cache disk space add link to on-call log Add section of things to do while on-call

Merge branch 'master' of gitlab.com:gitlab-com/runbooks
c392ea76 · Alex Hanselka · 2a24fbc7 · 63f53bcb · c392ea76 · c392ea76
Commit c392ea76 authored 7 years ago by Alex Hanselka
--- a/README.md
+++ b/README.md
@@ -68,6 +68,10 @@ The aim of this project is to have a quick guide of what to do when an emergency
  
 ## How do I
  
+### On Call
+
+* [Common tasks to perform while on-call](howto/oncall.md)
+
 ### Deploy
  
 * [Get the diff between dev versions](howto/dev-environment.md#figure-out-the-diff-of-deployed-versions)
@@ -87,9 +91,10 @@ The aim of this project is to have a quick guide of what to do when an emergency
 * [Use aptly](howto/aptly.md)
 * [Disable PackageCloud](howto/stop-or-start-packagecloud.md)
  
-### Work with the Database
+### Restore Backups
  
 * [Database Backups and Replication with Wal-E](howto/using-wale.md)
+* [LVM and Azure snapshots](howto/manage-snapshot-backups.md)
  
 ### Work with storage
  

--- a/howto/alerts_manual.md
+++ b/howto/alerts_manual.md
@@ -23,7 +23,7 @@ The common procedure is as follows:
 ALERT runners_cache_is_down
  IF probe_success{job="runners-cache", instance="localhost:9100"} == 0
  FOR 10s
-  LABELS {severity="critical", channel="infrastructure", pager="pagerduty"}
+  LABELS {severity="critical", channel="production", pager="pagerduty"}
  ANNOTATIONS {
    title="Runners cache has been down for the past 10 seconds",
    runbook="howto/howto/manage-cehpfs.md"
@@ -31,7 +31,7 @@ ALERT runners_cache_is_down
  }
 ```
  
-This will result in a critical alert posted in slack channes `#prometheus-alerts` and `#infrastructure`, pagerduty with a link to https://dev.gitlab.com/cookbooks/runbooks/blob/master/howto/manage-cehpfs.md. Important part is the end or url - `howto/manage-cehpfs.md`. It is taken from annotation `runbook`. Runbook will provide information how to manage situation alerted. Main principle of the runbook should be - `don't make me think`.
+This will result in a critical alert posted in slack channes `#prometheus-alerts` and `#production`, pagerduty with a link to https://dev.gitlab.com/cookbooks/runbooks/blob/master/howto/manage-cehpfs.md. Important part is the end or url - `howto/manage-cehpfs.md`. It is taken from annotation `runbook`. Runbook will provide information how to manage situation alerted. Main principle of the runbook should be - `don't make me think`. For channel you can use `#production`, `#ci`, `#gitaly` values.
  
 ### What if I want to add more data?
  
@@ -53,7 +53,7 @@ All alerts are routed to slack and additionally can be paged to PagerDuty.
  
 1. Since all alerts sended to slack, you can control only the type of alert.
 1. All alerts will be shown in `#prometheus-alerts` channel.
-1. Additionally you can send alerts to `#ci`, `#infrastructure` channels. This part controlled with the label `channel='ci'` and `channel='infrastructure'`.
+1. Additionally you can send alerts to `#ci`, `#gitaly`, `#production` channels. This part controlled with the label `channel='ci'` for example.
 1. Alerts with `severity=critical` are red colored messages with `.title` and link to corresponding runbook and `.description` values from alert.
 1. Alerts with `severity=warn` are yellow colored messages with `.title` and link to corresponding runbook and `.description` values from alert.
 1. Alerts with `severity=info` are green colored messages with `.title` and link to corresponding runbook and `.description` values from alert.
@@ -70,6 +70,30 @@ All alerts are routed to slack and additionally can be paged to PagerDuty.
  
 Currently we are not using email alerting rules.
  
+### Add more slack channels to alerts
+
+1. In routes section of [alertmanager.yml template](https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/blob/master/templates/default/alertmanager.yml.erb) add following:
+```
+  - match:
+      channel: gitaly
+    receiver: slack_gitaly
+    continue: true
+```
+1. In receivers section [alertmanager.yml template](https://gitlab.com/gitlab-cookbooks/gitlab-prometheus/blob/master/templates/default/alertmanager.yml.erb) add following. Note that `send_resolved`, `icon_emoji`, etc values must be taken from `slack_production` receiver:
+```
+- name: slack_gitaly
+  slack_configs:
+  - api_url: '<%= @conf['slack']['api_url'] %>'
+    channel: '#gitaly'
+    send_resolved: true
+    icon_emoji: ...
+    title: ...
+    title_link: ...
+    text: ...
+    fallback: ...
+```
+
+
 ### Note about alerts which not fit in any routes
  
 1. Alerts which are routed by default route will be sent to `#prometheus-alerts` channel in slack.

--- a/howto/elasticsearch.md
+++ b/howto/elasticsearch.md
@@ -47,3 +47,16 @@ curl 'http://localhost:9200/_cat/thread_pool?v'
 ```
 curl http://localhost:9200/_template/logstash?pretty
 ```
+
+### Create new index
+
+```
+curl -XPUT localhost:9200/gitlab -d '{
+  "settings": {
+    "index" : {
+      "number_of_shards" : 120,
+      "number_of_replicas": 1
+    }
+  }
+}'
+```
--- a/howto/manage-chef.md
+++ b/howto/manage-chef.md
@@ -33,8 +33,8 @@ This is done locally by another chef admin:
 To do this it will be necessary to create a new keypair. Because of how chef behaves the key has to be called _default_
  
 * ssh into the chef server
-* run `sudo -i chef-server-ctl remove-user-key _username_ default` to remove the default.
-* run `sudo -i chef-server-ctl add-user-key _username_ default` to create a new default key.
+* run `sudo -i chef-server-ctl delete-user-key _username_ default` to remove the default.
+* run `sudo -i chef-server-ctl add-user-key _username_ --key-name default` to create a new default key.
 * copy the private key generated by chef for this user.
  
 If _default_ is not declared chef will use the fingerprint of the key as a name.

--- a/howto/manage-snapshot-backups.md
+++ b/howto/manage-snapshot-backups.md
+# Backups and restore
+We currenty have multiple backup solutions:
+- AWS snapshots by `ebs.gitlap.com`
+- Azure snapshots by `azure.gitlap.com`
+- GitLab backup for database and pages
+
+## AWS snapshots
+### Snapshots
+Every night on `ebs.gitlap.com` the following snapshot script `/opt/gitlab-backup/bin/gitlab-ebs-snapshot` creates snapshots for all EC2 instances.
+### Restore
+https://dev.gitlab.org/cookbooks/gitlab-backup/blob/master/doc/gitlab-ebs-snapshot.md#restoring
+
+## Azure snapshots
+### Snapshots
+Every night on `azure.gitlap.com` the following snapshot script `/opt/gitlab-backup/bin/gitlab-azure-snapshots` creates snapshots for the Azure instances mentioned in `/etc/gitlab-azure-snapshots.yml`.
+
+Just add the chef role `"role[azure-snapshot]"` to a node and snapshots will be created.
+
+### Restore
+On `azure.gitlap.com` you can use script `/opt/gitlab-backup/bin/gitlab-azure-restore` to restore the snapshots and attach the data disks to a new created instance.
+
+So to restore i.e. file-storage1.cluster.gitlab.com follow these steps:
+#### Create new node
+```
+~/chef-repo/tools/bin/azure-create-node --role backend file-storage2.cluster.gitlab.com
+```
+#### Restore snapshot and attach the data disks
+Login into `azure.gitlap.com`
+
+You can get a list of available epochs by checking the snapshot info files in /var/lib/gitlab-azure-snapshots:
+```
+ls -al /var/lib/gitlab-azure-snapshots/
+```
+Restore snapshots:
+```
+/opt/gitlab-backup/bin/gitlab-azure-restore --epoch 1463965202 --source file-storage1.cluster.gitlab.com file-storage2.cluster.gitlab.com
+```
+
+#### Activate logical volume
+Activate the logical volume in the new created node:
+```
+# LVM support
+sudo apt-get install -y lvm2
+# look for gitlab_vg on the attached drives
+sudo vgchange -ay gitlab_vg
+# make sure the mountpoint exists
+sudo mkdir -p /var/opt/gitlab
+# mount the logical volume at /var/opt/gitlab
+sudo mount /dev/gitlab_vg/gitlab_com /var/opt/gitlab
+```
--- a/howto/oncall.md
+++ b/howto/oncall.md
+# So you got yourself on call
+
+To start with the rigth foot let's define a set of tasks that are nice things to do before you go any further in your week
+
+By performing these tasks we will keep the [broken window effect](https://en.wikipedia.org/wiki/Broken_windows_theory) under control, preventing future pain and mess.
+
+## Things to keep an eye on
+
+### On-call log
+
+First check [the on-call log](https://docs.google.com/document/d/1nWDqjzBwzYecn9Dcl4hy1s4MLng_uMq-8yGRMxtgK6M/edit#heading=h.nmt24c52ggf5) to familiarize yourself with what has been hapening lately, if anything is on fire it should be written down there in the **Pending actions** section
+
+### Alerts
+
+Start by checking how many alerts are in flight right now, to do this:
+
+- go to the [fleet overview dashboard](https://performance.gitlab.net/dashboard/db/fleet-overview) and check the number of Active Alerts, it should be 0. If it is not 0
+  - go to the alerts dashboard and check what is [being triggered](https://prometheus.gitlab.com/alerts) each alert here should point you to the right runbook to fix it.
+  - if they don't, you have more work to do.
+  - be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
+
+### Nodes status
+
+Go to your chef repo and run `knife status`, if you see hosts that are red it means that chef hasn't been running there for a long time. Check in the oncall log if they are disabled for any particular reason, if they are not, and there is no mention of any ongoing issue in the on-call log, consider jumping in to check why chef has not been running there.
+
+### Prometheus targets down
+
+Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:
+
+- go to the [fleet overview dashboard](https://performance.gitlab.net/dashboard/db/fleet-overview) and check the number of Targets down. It should be 0. If it is not 0
+  - go to the [targets down list](https://prometheus.gitlab.com/consoles/up.html) and check what is.
+  - try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
+  - be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
--- a/troubleshooting/db_scripts/sq.rb
+++ b/troubleshooting/db_scripts/sq.rb
@@ -6,11 +6,11 @@
 #
 # If you need to run this on a Omnibus GitLab machine, run:
 #
-# sudo gitlab-rails runner /full_pathname/sq.rb [kill|show] <worker name>
+# sudo gitlab-rails runner /full_pathname/sq.rb [kill|show|kill_jid] <worker name or Job ID>
 #
 # Or:
 #
-# BUNDLE_GEMFILE=/opt/gitlab/embedded/service/gitlab-rails/Gemfile /opt/gitlab/embedded/bin/bundle exec /opt/gitlab/embedded/bin/ruby sq.rb -h <hostname> -a <password> [kill|show] <worker name>
+# BUNDLE_GEMFILE=/opt/gitlab/embedded/service/gitlab-rails/Gemfile /opt/gitlab/embedded/bin/bundle exec /opt/gitlab/embedded/bin/ruby sq.rb -h <hostname> -a <password> [kill|show|kill_jid] <worker name or Job ID>
 #
 require 'optparse'
 require 'sidekiq/api'
@@ -20,14 +20,14 @@ Options = Struct.new(
  :dry_run,
  :hostname,
  :password,
-  :socket,
+  :socket
 )
  
 def parse_options(argv)
  options = Options.new
  
  opt_parser = OptionParser.new do |opt|
-    opt.banner = "Usage: #{__FILE__} [options] [kill|show] <worker name>"
+    opt.banner = "Usage: #{__FILE__} [options] [kill|show|kill_jid] <worker name or job ID>"
  
    opt.on('-a', '--auth PASSWORD', 'Redis password') do |password|
      options.password = password
@@ -111,6 +111,21 @@ def kill_jobs_by_worker_name(options, worker_name)
  count
 end
  
+def kill_job_by_id(options, job_id)
+  queue = Sidekiq::Queue.all
+
+  queue.each do |q|
+    q.each do |job|
+      next unless job.jid == job_id
+
+      job.delete unless options.dry_run
+      return true
+    end
+  end
+
+  false
+end
+
 def pretty_print(data)
  data = data.sort_by { |_key, value| value }.reverse
  
@@ -119,27 +134,40 @@ def pretty_print(data)
  end
 end
  
-def show_sidekiq_data(options)
-    queue_data, job_data = load_sidekiq_queue_data
-    puts '-----------'
-    puts 'Queue size:'
-    puts '-----------'
-    pretty_print(queue_data)
-    puts '------------------------------'
-    puts 'Top job counts with arguments:'
-    puts '------------------------------'
-    pretty_print(job_data)
+def show_sidekiq_data
+  queue_data, job_data = load_sidekiq_queue_data
+  puts '-----------'
+  puts 'Queue size:'
+  puts '-----------'
+  pretty_print(queue_data)
+  puts '------------------------------'
+  puts 'Top job counts with arguments:'
+  puts '------------------------------'
+  pretty_print(job_data)
 end
  
 if $PROGRAM_NAME == __FILE__
  options = parse_options(ARGV)
  configure_sidekiq(options)
  
-  show_sidekiq_data(options) unless options.command.length > 0
+  show_sidekiq_data unless options.command.length > 0
  
  case options.command[0]
  when 'show'
    show_sidekiq_data(options)
+  when 'kill_jid'
+    if options.command.length != 2
+      puts 'Specify a Job ID to kill'
+      exit
+    end
+
+    jid = options.command[1]
+    result = kill_job_by_id(options, jid)
+    if result
+      puts "Killed job ID #{jid}"
+    else
+      puts "Unable to find job ID #{jid}"
+    end
  when 'kill'
    if options.command.length != 2
      puts 'Specify a worker (e.g. RepositoryUpdateMirrorWorker)'

--- a/troubleshooting/runners_cache_disk_space.md
+++ b/troubleshooting/runners_cache_disk_space.md
@@ -4,26 +4,29 @@ Free disk space on runners cache node is less than 20%.
  
 ## Possible checks
  
-SSH to the `runners-cache-1.gitlab.com`. You can check available space by executing `df -h | grep /dev/vda1`.
-Check which directory is consuming the largest space by executing `du -h -d 1 /opt`, most of the cases it will
-be either `/opt/gitlab/minio` or `/opt/gitlab/registry`.
+* SSH to the `runners-cache-1.gitlab.com`.
+* Check available space by executing `df -h | grep /dev/vda1`.
+* Check which directory is consuming the largest space by executing `cd /; du -h -d 1 -x`, `-x` limits to a single filesystem to prevent reaching larger drives
+  * Most of the cases it will be either `/opt/minio` or `/opt/registry`.
+  * Some hosts mount the `cache`, AKA `minio` and `registry` folders in `/opt/gitlab/` using a different drive.
  
-## Fixing `/opt/gitlab/minio`
+## Fixing `/opt/minio` or `/opt/gitlab/cache`
  
-On host there is cron job which every hour deletes files from cache which is more than 4 days old.
-But you can delete files from cache which is more three days old by running following command
+> Adjust these commands depending on the filesystem structure.
+
+You can delete files from cache which is more three days old by running following command
  
 ```
-sudo find /opt/gitlab/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -3 -exec rm -rf {} \;
+sudo find /opt/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -3 -exec rm -rf {} \;
 ```
  
 Or more than two days old
  
 ```
-sudo find /opt/gitlab/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -2 -exec rm -rf {} \;
+sudo find /opt/minio/runner/runner/ -mindepth 3 -maxdepth 6 -ctime -2 -exec rm -rf {} \;
 ```
  
-## Fixing `/opt/gitlab/registry`
+## Fixing `/opt/registry`
  
 First, stop the registry container
  
@@ -31,10 +34,10 @@ First, stop the registry container
 sudo docker stop registry
 ```
  
-then remove everything in `/opt/gitlab/registry`
+then remove everything in `/opt/registry`
  
 ```
-sudo rm -r /opt/gitlab/registry/*
+sudo rm -r /opt/registry/*
 ```
  
 and finally start the registry container again
@@ -42,3 +45,4 @@ and finally start the registry container again
 ```
 sudo docker start registry
 ```
+