Gitlab On Call Run Books
The aim of this project is to have a quick guide of what to do when an emergency arrives
CRITICAL
- Spend one minute and create issue for outage, don't forget about
outage
label as specified in handbook.
What to do when
- Sidekiq Queues are out of control
- Workers have huge load because of cat-files
- GitLab Pages returns 404
- HAProxy is missing workers
- Worker's root filesystem is running out of space
- Azure Load Balancers Misbehave
- Kibana is down
- SSL certificate expires
- GitLab registry is down
- Sidekiq stats no longer showing
- Sentry is down
Replication fails
Chef/Knife
CI
- The CI runner manager report a high DO Token Rate Limit usage
- The CI runner manager report a high number of errors
- Runners cache is down
- Runners registry is down
- Runners cache free disk space is less than 20%
CephFS
Alerting and monitoring
- GitLab monitoring overview
- How to add alerts: Alerts manual
- How to silence alerts
- Alert for SSL certificate expiration
- Working with Grafana
- Working with Prometheus
- Upgrade Prometheus and exporters
Outdated
- The NFS server
backend4
is gone - The DB server
db[45]
is under heavy load - Redis keys state UNKNOWN
- Locks in PostgreSQL or Stuck Sidekiq workers
- Postfix queue is stale/growing
- Errors are reported in LOG files
How do I
Deploy
- Get the diff between dev versions
- Deploy GitLab.com
- Rollback GitLab.com
- Deploy staging.GitLab.com
- Refresh data on staging.gitlab.com
Work with the fleet and the rails app
- Restart unicorn with a zero downtime
- Gracefully restart sidekiq jobs
- Start a rails console in the staging environment
- Start a redis console in the staging environment
- Start a psql console in the staging environment
- Force a failover with postgres or redis
- Use aptly
- Disable PackageCloud
Work with the Database
Work with storage
Mangle front end load balancers
Work with Chef
- Create users, rotate or remove keys from chef
- Update packages manually for a given role
- Rename a node already in Chef
- Speed up chefspec tests
- Retrieve old values in a Chef vault
- Manage Chef Cookbooks
- Best practices and tips
Work with CI Infrastructure
- Update GitLab Runner on runners managers
- Investigate Abuse Reports
- Create runners manager for GitLab.com
- Update docker-machine
Work with Infrastructure Providers (VMs)
- Create a DO VM for a Service Engineer
- Create VMs in Azure, add disks, etc
- Bootstrap a new VM
- Remove existing node checklist
Manually ban an IP or netblock
Debug and monitor
- Tracing the source of an expensive query
- Work with Kibana (logs view)
- Work with Check_MK (Notifications, scheduled downtime, acknowledge problems)
- Reload CheckMK metrics
- Run pgbadger to analyze queries
Manage backups
General guidelines in an emergency
- Confirm that it is actually an emergency, challenge this: are we losing data? Is GitLab.com not working?
- Tweet in a reassuring but informative way to let the people know what's going on
- Join the
#infrastructure
channel - Define a point person or incident owner, this is the person that will gather all the data and coordinate the efforts.
- Organize:
- Establish who is the point person on the incident in the
#infrastructure
channel: "@here I'm taking point" and pin the message for the duration of the emergency. - Start a war room using zoom if it will save time
- Share the link in the #infrastructure channel
- If the point person needs someone to do something, give a direct command: @SOMEONE: please run
this
command
- Establish who is the point person on the incident in the
- Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
- If you have conflicting information, stop and think, bounce ideas, escalate
- Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened
- If we lack monitoring or alerting Open an issue and label as
monitoring
, even if you close issue immediately. See handbook
Guidelines
Other Servers and Services
Adding runbooks rules
- Make it quick - add links for checks
- Don't make me think - write clear guidelines, write expectations
- Recommended structure
- Symptoms - how can I quickly tell that this is what is going on
- Pre-checks - how can I be 100% sure
- Resolution - what do I have to do to fix it
- Post-checks - how can I be 100% sure that it is solved
- Rollback - optional, how can I undo my fix