GitLab.com outage 2017-10-06
Context
At 11:33 we noticed a spike in the 500 errors on the web and the api nodes started running at 100% cpu the investigation which followed lead us to understand that the package versions had been overridden and the site was running on an out dated gitlab version. We quickly fixed the underlying root cause and installed the package across the fleet manually.
Timeline
11:33 - errors 500 increase on the web nodes, high load on the app
11:33 - High load on the API nodes
11:38 - We detect an increase in Disk IO Wait (was a red herring)
11:40 - Following the idea of the database we see an increase in total updates for notes which may explain the iowait
11:41 - We realize that redis is pointing everything to redis3 because cache has no connections and all the throughput is coming from it.
12:26 - Stop all sidekiq to check if this would take the load down.
12:29 - We detect that the version running in gitlab.com is 9.4
12:30 - Stopped everything everywhere taking downtime.
12:31 - installing the right version everywhere.
12:34 - we installed on the web and are back up.
- Web
- Git
- Api
- Registry
12:35 - propagating the changes
12:40 - Sidekiq (realtime,asap,besteffort) are running with the correct version
12:40 - registry and deploy node are back up with latest version
12:45 - deploy node is back up with latest version
12:54 - Re-checking versions all around
- Web is 10.0.3
- Sidekiq is 10.0.3
- Mailroom is 10.0.3
We have some hosts with 9.4 yet
9.4.0-ee.0: 52.179.157.139 52.184.187.253 52.179.154.226 52.184.197.33 10.69.14.101 52.184.197.56 52.225.219.227 40.70.67.3 10.69.14.102
12:56 - Performing a rolling restart across the FE fleet just to be sure. All restarted on the right version
13:03 - We realize that Gitaly is with the wrong version.
13:06 - Everything is 10.0.3