Post-mortem for the nfs011 issue where repositories went temporarily missing for customers

Context

A set of users with repositories stored on a single storage node saw some of their repositories disappear on gitlab.com. This was due to the fact that there was an issue related to repository caching on redis. At the same time there was a single user that caused heavy I/O resulting in a CPU spike on the same storage node.

Timeline

On date: 2017-08-15

00:15 UTC - First report by a user reporting a missing repo https://gitlab.com/gitlab-com/support-forum/issues/2320
07:00 UTC - Slowness on gitlab.com noticed by Gitlab employees - https://gitlab.slack.com/archives/C101F3796/p1502780359000006
07:00 UTC - Hackernews post - https://news.ycombinator.com/item?id=15016173
07:15 UTC - oncall paged via pagerduty and https://gitlab.com/gitlab-com/infrastructure/issues/2503 opened to track issue.
07:18 UTC - first tweet sent to alert users of the issue - https://gitlab.slack.com/archives/C101F3796/p1502781533000208
08:38 UTC - root cause identified as a single user causing problems but sending us a large amount of data
08:52 UTC - script started to clear the redis cache on nfs011
08:57 UTC - script finished to clear redis cache on nfs011, issue with missing repos resolved

Incident Analysis

How was the incident detected? - By a user and then later by gitlab employees looking at the fleet overview / slack.
Is there anything that could have been done to improve the time to detection? - Better alarming
How was the root cause discovered? - Production engineer examining nfs011, ssh.
Was this incident triggered by a change? - No
Was there an existing issue that would have either prevented this incident or reduced the impact? - No

Root Cause Analysis

...

What went well

Coordination and the team working together.

What can be improved

It took more time than it should to run the cache clean script, this should probably be a rake command and delivered and tested with the application
We were unable to view the gitaly logs in ELK - https://gitlab.com/gitlab-com/infrastructure/issues/2505
Better gitaly metrics and monitoring and per process monitoring for gitaly and git on the storage nodes
- Gitaly Child Process accounting: https://gitlab.com/gitlab-org/gitaly/merge_requests/284: with this we'll be able to know about repos that are taking a disproportional amount of system resources.
Repository limits: the user had a 91GB git repository. We should impose limits on this type of usage
Gitaly Process CGroup: The NFS servers should be protected from Gitaly and it's child processes: https://gitlab.com/gitlab-com/infrastructure/issues/2364
Additional Runbook Information: On-call engineers should be better informed of when and how to kill Gitaly git child processes - https://gitlab.com/gitlab-com/runbooks/merge_requests/330
- (Particularly zombie git processes -- who's parent processes have been killed). Any git process with a PPID of 1 can almost always be killed (we could even script this?)
Runbook Scripts Out of Date: if possible, runbook utilities like the one used to clear the cache, should be checked into the Git repo as methods and tested as part of CI, to ensure that they don't get out of date and break due to application code changes
- This happened in this issue and prolonged the incident
Repository Exists cache TTL should be set to seconds TTL: especially once it moves over to Gitaly.

Corrective actions

Guidelines

Edited Aug 18, 2017 by John Jarvis

Admin message