Per the discussion in https://gitlab.com/gitlab-org/gitlab-ce/issues/26897#note_31309725, currently when an NFS server has an outage, GitLab.com has an outage. This is unnecessary: it should be possible to have GitLab.com continue to run while serving an error message that can be understood by any user when the git repo that they are trying to access can't be reached. (I'd like to restrict the discussion to NFS servers with git repos, to limit scope and because there are other solutions in the works for non git repos.)
Proposal
As a user, I expect to be able to surf around to issue boards, merge requests, and different projects, even when an NFS server is down. Instead of GitLab.com being down, I'd expect to see an error message in that part of the screen where I would have seen information from / about the git repo, that simply states "Sorry, this repo seems to be temporarily unavailable, for more details check the status of NFS server XX on monitor.gitlab.net/{relevant-deeplink-that-was-automatically-prepopulated}"
Deliverable
This issue has Deliverable for the Circuit Breaker functionality - making the rest of the application work is a lot more complex and isn't part of this Deliverable.
As a user, I expect to be able to surf around to issue boards, merge requests, and different projects, even when an NFS server is down.
This is technically doable, but harder than it may seem because tons and tons of pages talk to the Git repo at some point. But we can of course track those down and add proper error handling.
Instead of GitLab.com being down, I'd expect to see an error message in that part of the screen where I would have seen information from / about the git repo, that simply states "Sorry, this repo seems to be temporarily unavailable, for more details check the status of NFS server XX on monitor.gitlab.net/{relevant-deeplink-that-was-automatically-prepopulated}"
We can show an error, but not a GitLab.com specific one, since GitLab doesn't (and shouldn't) know anything about NFS, or monitor.gitlab.net.
This is technically doable, but harder than it may seem because tons and tons of pages talk to the Git repo at some point. But we can of course track those down and add proper error handling.
Awesome! I think that would be the way to start. GitLab would have to be sentient enough to know what server the repo is on... or not even for a first pass?
Why shouldn't GitLab know anything about NFS? It would be great to have more insight into what causes load on NFS, and two way visibility would help with that... I assume.
@mydigitalself without going into too much detail, what happens is that a user hits a page, this page creates a Rugged object to read something from a repo, this object locks waiting for the timeout of the NFS storage when trying to load the refs file fo the repo, the whole unicorn process is locked.
Multiply by multiple users doing something similar in repos in the same storage, in little time we get all the unicorn processes locked waiting on a really long timeout ==> we have no processes left, Nginx starts returning errors 502 because of they all timing out, GitLab.com is down.
Consider that we access git from many, many, many, many places in the app, just think of people checking MRs, or browsing commits, it doesn't take much to have a couple of hundreds of people doing this.
Is there a way get 90% of the repositories back within a few minutes? I think for this we would need to prevent calls to the affected NFS server. Can we maybe do that by just not mounting the affected server?
Solution is to use Gitaly completely and deprecate using NFS on the app servers.
I think we all agree on that, but there are a few caveats:
At our current pace of development, it could easily be 6-12 months before all git controllers can be handled by Gitaly (@andrewn feel free to weigh in on that rough estimate)
@andrewn can you add a design diagram there of what the world will look like "post-NFS"?
It's not yet clear how GitLab.com behaves when Gitaly goes down; this is being discussed and investigated in gitaly#200. So it is quite possible that even in the "post-NFS" world, we'd end up with the situation where a Gitaly server goes down leads to GitLab.com going down.
So based on this, I think we need to build graceful degradation in the application regardless of whether git repos are held by NFS or Gitaly.
Certainly to reach availability of GitLab.com >= 99.9% within the next quarter, we'll need to focus on the degradation side.
This is technically doable, but harder than it may seem because tons and tons of pages talk to the Git repo at some point. But we can of course track those down and add proper error handling.
Yeah, I'm sure we'll find lots of places where we didn't expect Git repo access, but I think we'll need to figure this out in any case to transition fully to Gitaly.
Is there a way get 90% of the repositories back within a few minutes? I think for this we would need to prevent calls to the affected NFS server. Can we maybe do that by just not mounting the affected server?
I don't think there's a simple way of getting a close enough clone of the filesystem that went down in a timely manner, just building a new machine and bootstrapping it with chef takes already the TTR of a single host going down.
Regarding the not mounting the affected server I think that that could be an interesting test to run, maybe we could have a spare small-ish NFS server available to swap it in the case of a server crash, then when we recover we would need to sync back the files that are added to it.
I'm not sure how well it would work, so we should test how the application would behave though because I do wonder how much would it fail depending on what files are missing, maybe we could make this happen in staging and just check, and in case it makes sense we would need to script it so we reduce the TTR that way for regular files.
maybe we can improve the MTTR by manually switching to a degraded service.
Yes, we could trigger the degraded service manually indeed, and it makes a lot of sense, we just need to define what a degraded mode is.
Since we’ll still be relying on NFS, graceful degradation in Gitaly is meaningless since the whole server is still out of service. This won’t change until NFS is totally removed.
When NFS is gone, Gitaly will simply return an error if a git call is made from rails and the file server is down. The problem of graceful degradation is really about how the rails developers handle that error.
@pcarranza I believe that rebooting the server takes 45 minutes. I think that the storage of the NFS server is separate (Azure storage disks). Instead of rebooting the server can we just attach the disks to a standby server that we can quickly configure to have the right configuration?
The ideal route to solving this problem is using Gitaly
Gitaly will solve this problem when all git traffic is routed to gitaly
Because grpc supports timeouts and deadlines, and because of Golang's green-threading, I am confident that it will be easy to deal with this issue in Gitaly, even before we have completed testing https://gitlab.com/gitlab-org/gitaly/issues/200
However, as long as some operations are going over NFS, Gitaly cannot be relied on as the complete solution
What can we do until Gitaly can solve this problem? I have several proposals which could be used independently or together.
As @pcarranza mentioned, the problem with NFS is unicorn workers queuing up while trying to perform reads from the faulty NFS volume.
This is different but somewhat similar to a big problem we had early on with Gitter, in that, since we rely so heavily on the GitHub API, Gitter's availability was dependent on GitHub's availability. With effort, we've made it that GitHub outages now has almost no effect on Gitter.
Gates: if we know that a particular NFS mount is down, we should throw an exception immediately rather than trying to read from the volume.
The idea: keep a list of known-bad NFS volumes. Store this Redis or somewhere. Allow admins to toggle a storage volume as down
Gitlab::Git objects take a repository path in their constructor. Wrap these objects in a Gate that checks the repository location against known-bad volumes. IF the repo matches, throw an exception immediately, so that the unicorn worker doesn't get hung up waiting for for NFS timeouts
Timeouts:
The idea: wrap all the Gitlab::Git objects with timeouts
This can be done in the same way as we dynamically wrap the Gitlab::Git with timing instrumentation
For the web interface (rails) this can be "relatively" short, say 20 or 30 seconds
When a NFS volume fails, the web requests will timeout after the period and not get hung up waiting on NFS
Still not ideal, but at least this gives us time to mark the NFS volume as bad using the Gates technique (for eg) above.
Stale Caching: Gitter is heavily depend on caching for speed but also availability. Unlike many caching strategies we use stale data if the backend is unavailable
Hot cache items are served immediately
On cache miss, we attempt to fetch from source, but if that operation fails, we'll use whatever stale data we have cached, even if it's out of date (would you prefer a slightly out of date
Complexity goes beyond just respawning a host with the drives from another one. Let's unfurl all the steps that would be required to get here:
We stop the machine the machine that crashed so Azure does not try to boot it up (this will take some minutes too, I've seen it taking up to 10 at least once)
We detach all the drives from this crashed machine.
We spin up a new host with the OS and all the available drives using terraform.
We wait this server to perform drive checks because they will be marked as not properly shut down.
We change the NFS mountpoint in all the fleet servers, force an umount, maybe kill the processes that are in D state.
We remount the NFS drive across the whole fleet.
We restart all the services accross the whole fleet.
I think that this, with a lot of good automation and all the moving parts working perfectly would leave us in the same ball park of MTTR, maybe if everything goes fine we could shave 5 minutes out of it.
But this sort of low level operation under high stress will add a lot of risk because this all has to happen while we have downtime. The chance of something going wrong and the server not recovering because of all the drive detaching and attaching will increase a disproportionate amount.
Honestly, trying to solve this at the host level by rebuilding a critical host under crazy pressure while taking downtime is looking at the wrong solution. Hosts will go down, the application has to survive it.
I think we need to move in the direction of stability patterns as the ones that @andrewn talks about: the application has to support having a file host going down and err out cleanly and quickly, instead of locking on an endless wait.
Been thinking about this and have a potential short-term work-around that would be easy to implement:
On each worker that mounts NFS volumes, the NFS mounts are located at /var/opt/gitlab/...
What we do:
mount a readonly tmpfs volume at /var/opt/gitlab/ and then mount the NFS volumes under this directory, as before
We run a cronjob every 30 seconds that:
For each mount_point, the cronjob attempts to touch a file /var/opt/gitlab/${mount_point}/.monitor/${client_host_name} and read the contents of the file.
This operation has a defined timeout (1 minute? 10 seconds, I don't know?) if this timeout exceeds the NFS volume is in trouble and needs to be dismounted immediately before the unicorn worker processes start queuing up
for dir in /var/opt/gitlab/*;do(timeout 30 touch_and_write_file "${dir}")||(send_volume_alert_to_ops "${dir}"; umount -f"${dir}")done
When the volume is unmounted, any application code that attempts to read or write to the bad volume will error immediately without hanging. Obviously this will result in more 500's on certain repos but that's better than an outage.
When the volume unmounts, any writes will be to the readonly tmpfs volume /var/opt/gitlab/ (and will error) and reads will be to non-existent files (and will error).
This might seem pretty heavy handed, but it's a quick workaround that doesn't require changes to application logic that might be worth experimenting with.
@andrewn that's an interesting proposal, I think that the tmpfs variant makes it simpler. But I think that we should be triggering this manually (at least initially).
@ernstvn the problem with a host going down is that they are out of our control, so it all falls on Azure's side to investigate why they take so long to boot up.
I like @andrewn's idea of using a read-only tmpfs. I think we would have to solve gitlab-org/gitlab-ce#33117 for this to work, otherwise repositories will go missing until we manually expire the cache. Once we have this, I think the next step is try to set up a separate GitLab instance and swap out an NFS mount and see exactly how things break.
@pcarranza totally agree that we should be triggering it manually, at least until we know what the right parameters (timeouts etc) for doing this automatically.
Having said that I think we should use the cron that polls the file-systems from the outset, having it send an alert to Pagerduty, so that we can get a good idea of exactly when dismount would occur if it was automated.
I'd expect to see an error message in that part of the screen where I would have seen information from / about the git repo, that simply states "Sorry, this repo seems to be temporarily unavailable, for more details check the status of NFS server XX on monitor.gitlab.net/{relevant-deeplink-that-was-automatically-prepopulated}"
It sounds as though this is still being worked out @ernstvn. Is the UX described above still needed? I want to make sure UX has time to get eyes on this if needed.
@sarrahvesselov I think we do still need a form of graceful degradation, and it will probably require a UX component. But I'd like to ask @stanhu to weigh in on technical priorities and @victorwu (right? as this pertains mostly to rendering of "discussion" items?) for prioritization from the product perspective.
At our current pace of development, it could easily be 6-12 months before all git controllers can be handled by Gitaly (@andrewn feel free to weigh in on that rough estimate)
Per the Gitaly team OKR, they will aim to tackle 24 endpoint migrations per quarter. Per the original sheet that was used to determine the order of priority of migrations, there are 211 total, and 63 with a p99 timing (pre-Gitaly) greater than 800 ms. To move fully to Gitaly requires resolving all 211 endpoints, which at the pace of 24 per quarter will take ~10 months.
@andrewn the only concern I have with using tmpfs is that we will be accepting writes into this drive, and when the original NFS drives recovers we will need to get things in sync, which in case of using tmpfs would mean that we will need to sync all the possible hosts that are mounting their own tmpfs. I think that a better approach would be to replace one NFS mountpoint with another NFS. That way we can sync data from a single place.
I'll make the assumption that 1 is the most important and @DouweM correct me if I'm wrong, but that's effectively what @reprazent said he was working on in !11449 (merged). In this case, whilst it's still unhelpful to a user who experiences this and receives an error, it's helpful to the wider server health and availability.
We can certainly then look at 2 at a later date, but right now we're just completely swamped and under-resourced with all of the work we have going on with licensing to do that.
If 1 is happening right now, can we then agree that we can push this issue out into a future milestone?
Note this is also related to the conversation we had regarding https://gitlab.com/gitlab-com/infrastructure/issues/1943 whereby we're struggling how to prioritise these issues. /me thinks it's time for a conversation on this in real time so we can get to a better place on how we can help each other with production performance & availability issues. Spinning 10-minute cycles asynchronously on the topic doesn't seem to be moving us forward.
Once it is confirmed that !11449 (merged) takes care of the first step, I'll gladly spawn a separate new issue for the UX side of things, and close this issue.
So instead of having ngnix not finding a unicorn, we'd have rails render a 503 and try to avoid accessing the misbehaving FS. That means that instead of GitLab becoming unavailable, just the pages that touch something on the misbehaving FS become unavailable.
Once it is confirmed that !11449 (merged) takes care of the first step, I'll gladly spawn a separate new issue for the UX side of things, and close this issue
From this comment @dimitrieh, it looks as though there is no UX at issue here. Once the pending merge request is taken care of, this will be closed and a UX issue opened. Thanks for checking in on this!