GitLab.com outage 2017-10-06

Context

At 11:33 we noticed a spike in the 500 errors on the web and the api nodes started running at 100% cpu the investigation which followed lead us to understand that the package versions had been overridden and the site was running on an out dated gitlab version. We quickly fixed the underlying root cause and installed the package across the fleet manually.

Timeline

11:33 - errors 500 increase on the web nodes, high load on the app

11:33 - High load on the API nodes

11:38 - We detect an increase in Disk IO Wait (was a red herring)

11:40 - Following the idea of the database we see an increase in total updates for notes which may explain the iowait

11:41 - We realize that redis is pointing everything to redis3 because cache has no connections and all the throughput is coming from it.

12:26 - Stop all sidekiq to check if this would take the load down.

12:29 - We detect that the version running in gitlab.com is 9.4

12:30 - Stopped everything everywhere taking downtime.

12:31 - installing the right version everywhere.

12:34 - we installed on the web and are back up.

Web
Git
Api
Registry

12:35 - propagating the changes

12:40 - Sidekiq (realtime,asap,besteffort) are running with the correct version

12:40 - registry and deploy node are back up with latest version

12:45 - deploy node is back up with latest version

12:54 - Re-checking versions all around

Web is 10.0.3
Sidekiq is 10.0.3
Mailroom is 10.0.3

We have some hosts with 9.4 yet

9.4.0-ee.0: 52.179.157.139 52.184.187.253 52.179.154.226 52.184.197.33 10.69.14.101 52.184.197.56 52.225.219.227 40.70.67.3 10.69.14.102

12:56 - Performing a rolling restart across the FE fleet just to be sure. All restarted on the right version

13:03 - We realize that Gitaly is with the wrong version.

13:06 - Everything is 10.0.3

Edited Oct 06, 2017 by Jason Tevnan

Admin message

Admin message

GitLab.com outage 2017-10-06

Context

Timeline