Skip to content
Snippets Groups Projects
Select Git revision
  • test-2
  • test
  • gitaly-troubleshoot-update
  • master default protected
  • update-tweet-guidelines
  • mk-alerting-update
  • ci-introduction
  • mk-db-locks-alert
  • pc-backups
  • mk-drop-vm-checklist
10 results

runbooks

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Pablo Carranza authored
    97372ba7
    History

    Gitlab On Call Run Books

    The aim of this project is to have a quick guide of what to do when an emergency arrives

    CRITICAL

    • Spend one minute and create issue for outage, don't forget about outage label as specified in handbook.

    What to do when

    Replication fails

    Chef/Knife

    CI

    CephFS

    Alerting and monitoring

    Outdated

    How do I

    Deploy

    Work with the fleet and the rails app

    Work with the Database

    Work with storage

    Mangle front end load balancers

    Work with Chef

    Work with CI Infrastructure

    Work with Infrastructure Providers (VMs)

    Manually ban an IP or netblock

    Debug and monitor

    Manage backups

    General guidelines in an emergency

    • Confirm that it is actually an emergency, challenge this: are we losing data? Is GitLab.com not working?
    • Tweet in a reassuring but informative way to let the people know what's going on
    • Join the #infrastructure channel
    • Define a point person or incident owner, this is the person that will gather all the data and coordinate the efforts.
    • Organize:
      • Establish who is the point person on the incident in the #infrastructure channel: "@here I'm taking point" and pin the message for the duration of the emergency.
      • Start a war room using zoom if it will save time
      • Share the link in the #infrastructure channel
      • If the point person needs someone to do something, give a direct command: @SOMEONE: please run this command
    • Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
    • If you have conflicting information, stop and think, bounce ideas, escalate
    • Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened
    • If we lack monitoring or alerting Open an issue and label as monitoring, even if you close issue immediately. See handbook

    Guidelines

    Other Servers and Services

    Adding runbooks rules

    • Make it quick - add links for checks
    • Don't make me think - write clear guidelines, write expectations
    • Recommended structure
      • Symptoms - how can I quickly tell that this is what is going on
      • Pre-checks - how can I be 100% sure
      • Resolution - what do I have to do to fix it
      • Post-checks - how can I be 100% sure that it is solved
      • Rollback - optional, how can I undo my fix

    But always remember!

    Dont Panic