An error occurred while fetching the assigned iteration of the selected issue.
Post-mortem for the nfs011 issue where repositories went temporarily missing for customers
Context
A set of users with repositories stored on a single storage node saw some of their repositories disappear on gitlab.com. This was due to the fact that there was an issue related to repository caching on redis. At the same time there was a single user that caused heavy I/O resulting in a CPU spike on the same storage node.
Timeline
On date: 2017-08-15
- 00:15 UTC - First report by a user reporting a missing repo https://gitlab.com/gitlab-com/support-forum/issues/2320
- 07:00 UTC - Slowness on gitlab.com noticed by Gitlab employees - https://gitlab.slack.com/archives/C101F3796/p1502780359000006
- 07:00 UTC - Hackernews post - https://news.ycombinator.com/item?id=15016173
- 07:15 UTC - oncall paged via pagerduty and https://gitlab.com/gitlab-com/infrastructure/issues/2503 opened to track issue.
- 07:18 UTC - first tweet sent to alert users of the issue - https://gitlab.slack.com/archives/C101F3796/p1502781533000208
- 08:38 UTC - root cause identified as a single user causing problems but sending us a large amount of data
- 08:52 UTC - script started to clear the redis cache on nfs011
- 08:57 UTC - script finished to clear redis cache on nfs011, issue with missing repos resolved
Incident Analysis
- How was the incident detected? - By a user and then later by gitlab employees looking at the fleet overview / slack.
- Is there anything that could have been done to improve the time to detection? - Better alarming
- How was the root cause discovered? - Production engineer examining nfs011, ssh.
- Was this incident triggered by a change? - No
- Was there an existing issue that would have either prevented this incident or reduced the impact? - No
Root Cause Analysis
- TBD
...
What went well
- Coordination and the team working together.
What can be improved
- It took more time than it should to run the cache clean script, this should probably be a rake command and delivered and tested with the application
- We were unable to view the gitaly logs in ELK - https://gitlab.com/gitlab-com/infrastructure/issues/2505
- Better gitaly metrics and monitoring and per process monitoring for gitaly and git on the storage nodes
- Gitaly Child Process accounting: https://gitlab.com/gitlab-org/gitaly/merge_requests/284: with this we'll be able to know about repos that are taking a disproportional amount of system resources.
- Repository limits: the user had a 91GB git repository. We should impose limits on this type of usage
- Gitaly Process CGroup: The NFS servers should be protected from Gitaly and it's child processes: https://gitlab.com/gitlab-com/infrastructure/issues/2364
- Additional Runbook Information: On-call engineers should be better informed of when and how to kill Gitaly git child processes - https://gitlab.com/gitlab-com/runbooks/merge_requests/330
- (Particularly zombie git processes -- who's parent processes have been killed). Any git process with a PPID of 1 can almost always be killed (we could even script this?)
- Runbook Scripts Out of Date: if possible, runbook utilities like the one used to clear the cache, should be checked into the Git repo as methods and tested as part of CI, to ensure that they don't get out of date and break due to application code changes
- This happened in this issue and prolonged the incident
- Repository Exists cache TTL should be set to seconds TTL: especially once it moves over to Gitaly.
Corrective actions
- https://gitlab.com/gitlab-com/infrastructure/issues/1498
- https://gitlab.com/gitlab-com/infrastructure/issues/2511
- https://gitlab.com/gitlab-com/runbooks/merge_requests/330