Graceful degradation when git-repo can't be reached

Appreciate your input on this @mydigitalself @stanhu @JobV @pcarranza @DouweM .

As a user, I expect to be able to surf around to issue boards, merge requests, and different projects, even when an NFS server is down.

This is technically doable, but harder than it may seem because tons and tons of pages talk to the Git repo at some point. But we can of course track those down and add proper error handling.

Instead of GitLab.com being down, I'd expect to see an error message in that part of the screen where I would have seen information from / about the git repo, that simply states "Sorry, this repo seems to be temporarily unavailable, for more details check the status of NFS server XX on monitor.gitlab.net/{relevant-deeplink-that-was-automatically-prepopulated}"

We can show an error, but not a GitLab.com specific one, since GitLab doesn't (and shouldn't) know anything about NFS, or monitor.gitlab.net.

@DouweM

This is technically doable, but harder than it may seem because tons and tons of pages talk to the Git repo at some point. But we can of course track those down and add proper error handling.

Awesome! I think that would be the way to start. GitLab would have to be sentient enough to know what server the repo is on... or not even for a first pass?

Why shouldn't GitLab know anything about NFS? It would be great to have more insight into what causes load on NFS, and two way visibility would help with that... I assume.

@DouweM @pcarranza @ernstvn I'd love some more insight as to how and why the whole of GitLab.com is down if an NFS server is down.

I would have thought that it would only be a page load that tries to access git resources are on that particular server.

Note I'm adding this to 9.4 without full scope and to discuss prioritisation against other issues.

changed milestone to %9.4

@mydigitalself without going into too much detail, what happens is that a user hits a page, this page creates a Rugged object to read something from a repo, this object locks waiting for the timeout of the NFS storage when trying to load the refs file fo the repo, the whole unicorn process is locked.

Multiply by multiple users doing something similar in repos in the same storage, in little time we get all the unicorn processes locked waiting on a really long timeout ==> we have no processes left, Nginx starts returning errors 502 because of they all timing out, GitLab.com is down.

Consider that we access git from many, many, many, many places in the app, just think of people checking MRs, or browsing commits, it doesn't take much to have a couple of hundreds of people doing this.

Solution is to use Gitaly completely and deprecate using NFS on the app servers.

@pcarranza maybe we can improve the MTTR by manually switching to a degraded service. The MTTR is 45 minutes because it takes a while for the host to come back https://gitlab.com/gitlab-org/gitlab-ce/issues/26897#note_31265306

Is there a way get 90% of the repositories back within a few minutes? I think for this we would need to prevent calls to the affected NFS server. Can we maybe do that by just not mounting the affected server?

@sytses

Solution is to use Gitaly completely and deprecate using NFS on the app servers.

I think we all agree on that, but there are a few caveats:

At our current pace of development, it could easily be 6-12 months before all git controllers can be handled by Gitaly (@andrewn feel free to weigh in on that rough estimate)
In the interim, Gitaly runs kind of as a layer "on top of" NFS, per the architecture diagram in https://gitlab.com/gitlab-org/gitaly/blob/master/README.md#design, so that we still have the underlying problem of what happens when an NFS server goes down
- @andrewn can you add a design diagram there of what the world will look like "post-NFS"?
It's not yet clear how GitLab.com behaves when Gitaly goes down; this is being discussed and investigated in gitaly#200. So it is quite possible that even in the "post-NFS" world, we'd end up with the situation where a Gitaly server goes down leads to GitLab.com going down.
So based on this, I think we need to build graceful degradation in the application regardless of whether git repos are held by NFS or Gitaly.

Certainly to reach availability of GitLab.com >= 99.9% within the next quarter, we'll need to focus on the degradation side.

This is technically doable, but harder than it may seem because tons and tons of pages talk to the Git repo at some point. But we can of course track those down and add proper error handling.

Yeah, I'm sure we'll find lots of places where we didn't expect Git repo access, but I think we'll need to figure this out in any case to transition fully to Gitaly.

@sytses

Is there a way get 90% of the repositories back within a few minutes? I think for this we would need to prevent calls to the affected NFS server. Can we maybe do that by just not mounting the affected server?

I don't think there's a simple way of getting a close enough clone of the filesystem that went down in a timely manner, just building a new machine and bootstrapping it with chef takes already the TTR of a single host going down.

Regarding the not mounting the affected server I think that that could be an interesting test to run, maybe we could have a spare small-ish NFS server available to swap it in the case of a server crash, then when we recover we would need to sync back the files that are added to it.

I'm not sure how well it would work, so we should test how the application would behave though because I do wonder how much would it fail depending on what files are missing, maybe we could make this happen in staging and just check, and in case it makes sense we would need to script it so we reduce the TTR that way for regular files.

maybe we can improve the MTTR by manually switching to a degraded service.

Yes, we could trigger the degraded service manually indeed, and it makes a lot of sense, we just need to define what a degraded mode is.

mentioned in issue gitlab-com/infrastructure#1961

@andrewn via chat relayed

Since we’ll still be relying on NFS, graceful degradation in Gitaly is meaningless since the whole server is still out of service. This won’t change until NFS is totally removed.

When NFS is gone, Gitaly will simply return an error if a git call is made from rails and the file server is down. The problem of graceful degradation is really about how the rails developers handle that error.

@pcarranza I believe that rebooting the server takes 45 minutes. I think that the storage of the NFS server is separate (Azure storage disks). Instead of rebooting the server can we just attach the disks to a standby server that we can quickly configure to have the right configuration?

A few points:

The ideal route to solving this problem is using Gitaly
Gitaly will solve this problem when all git traffic is routed to gitaly
Because grpc supports timeouts and deadlines, and because of Golang's green-threading, I am confident that it will be easy to deal with this issue in Gitaly, even before we have completed testing https://gitlab.com/gitlab-org/gitaly/issues/200
However, as long as some operations are going over NFS, Gitaly cannot be relied on as the complete solution
What can we do until Gitaly can solve this problem? I have several proposals which could be used independently or together.

As @pcarranza mentioned, the problem with NFS is unicorn workers queuing up while trying to perform reads from the faulty NFS volume.

This is different but somewhat similar to a big problem we had early on with Gitter, in that, since we rely so heavily on the GitHub API, Gitter's availability was dependent on GitHub's availability. With effort, we've made it that GitHub outages now has almost no effect on Gitter.

Gates: if we know that a particular NFS mount is down, we should throw an exception immediately rather than trying to read from the volume.
- The idea: keep a list of known-bad NFS volumes. Store this Redis or somewhere. Allow admins to toggle a storage volume as down
- Gitlab::Git objects take a repository path in their constructor. Wrap these objects in a Gate that checks the repository location against known-bad volumes. IF the repo matches, throw an exception immediately, so that the unicorn worker doesn't get hung up waiting for for NFS timeouts
Timeouts:
- The idea: wrap all the Gitlab::Git objects with timeouts
- This can be done in the same way as we dynamically wrap the Gitlab::Git with timing instrumentation
- For the web interface (rails) this can be "relatively" short, say 20 or 30 seconds
- When a NFS volume fails, the web requests will timeout after the period and not get hung up waiting on NFS
- Still not ideal, but at least this gives us time to mark the NFS volume as bad using the Gates technique (for eg) above.
Stale Caching: Gitter is heavily depend on caching for speed but also availability. Unlike many caching strategies we use stale data if the backend is unavailable

Hot cache items are served immediately
On cache miss, we attempt to fetch from source, but if that operation fails, we'll use whatever stale data we have cached, even if it's out of date (would you prefer a slightly out of date
- Examples of code that uses this in Gitter: https://github.com/gitterHQ/request-http-cache#incoming and https://github.com/troupe/snappy-cache#create-a-two-tier-cache

@sytses

Complexity goes beyond just respawning a host with the drives from another one. Let's unfurl all the steps that would be required to get here:

We stop the machine the machine that crashed so Azure does not try to boot it up (this will take some minutes too, I've seen it taking up to 10 at least once)
We detach all the drives from this crashed machine.
We spin up a new host with the OS and all the available drives using terraform.
We wait this server to perform drive checks because they will be marked as not properly shut down.
We change the NFS mountpoint in all the fleet servers, force an umount, maybe kill the processes that are in D state.
We remount the NFS drive across the whole fleet.
We restart all the services accross the whole fleet.

I think that this, with a lot of good automation and all the moving parts working perfectly would leave us in the same ball park of MTTR, maybe if everything goes fine we could shave 5 minutes out of it.

But this sort of low level operation under high stress will add a lot of risk because this all has to happen while we have downtime. The chance of something going wrong and the server not recovering because of all the drive detaching and attaching will increase a disproportionate amount.

Honestly, trying to solve this at the host level by rebuilding a critical host under crazy pressure while taking downtime is looking at the wrong solution. Hosts will go down, the application has to survive it.

I think we need to move in the direction of stability patterns as the ones that @andrewn talks about: the application has to support having a file host going down and err out cleanly and quickly, instead of locking on an endless wait.

Been thinking about this and have a potential short-term work-around that would be easy to implement:

On each worker that mounts NFS volumes, the NFS mounts are located at /var/opt/gitlab/...

What we do:

mount a readonly tmpfs volume at /var/opt/gitlab/ and then mount the NFS volumes under this directory, as before
We run a cronjob every 30 seconds that:
- For each mount_point, the cronjob attempts to touch a file /var/opt/gitlab/${mount_point}/.monitor/${client_host_name} and read the contents of the file.
- This operation has a defined timeout (1 minute? 10 seconds, I don't know?) if this timeout exceeds the NFS volume is in trouble and needs to be dismounted immediately before the unicorn worker processes start queuing up

for dir in /var/opt/gitlab/*; do 
  (timeout 30 touch_and_write_file "${dir}") || (send_volume_alert_to_ops "${dir}"; umount -f "${dir}")
done

When the volume is unmounted, any application code that attempts to read or write to the bad volume will error immediately without hanging. Obviously this will result in more 500's on certain repos but that's better than an outage.

When the volume unmounts, any writes will be to the readonly tmpfs volume /var/opt/gitlab/ (and will error) and reads will be to non-existent files (and will error).

This might seem pretty heavy handed, but it's a quick workaround that doesn't require changes to application logic that might be worth experimenting with.

\cc @DouweM @stanhu seeing some good ideas in this thread, your thoughts?

@pcarranza thanks for unfurling the steps. Regarding

Hosts will go down, the application has to survive it.

I agree, BUT, I also think that we do need to improve the MTTR for the host that went down. Options? Worth pulling in conversation with Azure?

@andrewn that's an interesting proposal, I think that the tmpfs variant makes it simpler. But I think that we should be triggering this manually (at least initially).

We could be using this in a short time for the rest of the files here: https://gitlab.com/gitlab-com/infrastructure/issues/1961

Regarding the repos, I think we should also test how much would the application degrade.

Let's continue on https://gitlab.com/gitlab-com/infrastructure/issues/1961 to test this approach, then we can regroup for the git repos.

@ernstvn the problem with a host going down is that they are out of our control, so it all falls on Azure's side to investigate why they take so long to boot up.

Shall we ping Tarun on this one?

@pcarranza Yes, let's ping Tarun, but in a separate clean issue only describing our interest in reducing MTTR for NFS hosts. Can you make that?

I like @andrewn's idea of using a read-only tmpfs. I think we would have to solve gitlab-org/gitlab-ce#33117 for this to work, otherwise repositories will go missing until we manually expire the cache. Once we have this, I think the next step is try to set up a separate GitLab instance and swap out an NFS mount and see exactly how things break.

@pcarranza totally agree that we should be triggering it manually, at least until we know what the right parameters (timeouts etc) for doing this automatically.

Having said that I think we should use the cron that polls the file-systems from the outset, having it send an alert to Pagerduty, so that we can get a good idea of exactly when dismount would occur if it was automated.

I'd expect to see an error message in that part of the screen where I would have seen information from / about the git repo, that simply states "Sorry, this repo seems to be temporarily unavailable, for more details check the status of NFS server XX on monitor.gitlab.net/{relevant-deeplink-that-was-automatically-prepopulated}"

It sounds as though this is still being worked out @ernstvn. Is the UX described above still needed? I want to make sure UX has time to get eyes on this if needed.

@sarrahvesselov I think we do still need a form of graceful degradation, and it will probably require a UX component. But I'd like to ask @stanhu to weigh in on technical priorities and @victorwu (right? as this pertains mostly to rendering of "discussion" items?) for prioritization from the product perspective.

@ernstvn : Should be a Platform issue, since this is for navigation across all of GitLab under those scenarios.

OK, thanks @victorwu . In that case, @mydigitalself can you please weigh in?

As I said https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_31960550, I think the simplest place to start is to solve #33117 (closed) first; I think @reprazent is working on it. Then we need to see what happens if we mount and unmount a tmpfs directory in lieu of the NFS mount. I'd be curious what breaks with the Web and Sidekiq workers.

To follow up on one point I made earlier in https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_31406960

At our current pace of development, it could easily be 6-12 months before all git controllers can be handled by Gitaly (@andrewn feel free to weigh in on that rough estimate)

Per the Gitaly team OKR, they will aim to tackle 24 endpoint migrations per quarter. Per the original sheet that was used to determine the order of priority of migrations, there are 211 total, and 63 with a p99 timing (pre-Gitaly) greater than 800 ms. To move fully to Gitaly requires resolving all 211 endpoints, which at the pace of 24 per quarter will take ~10 months.

\cc @andrewn

As I said https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_31960550, I think the simplest place to start is to solve #33117 (closed) first; I think @reprazent is working on it.

In https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 I'm working on not having GitLab go down when one NFS has problems. Instead we back-off FS calls so the system can recover without clogging up all web-processes as @pcarranza described in https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_31373940.

It isn't the full implementation yet, as we'll return a 503-error whenever an access fails for all pages of repositories stored on a failing NFS.

The behavior for Gitaly calls in that MR will be similar, but the implementation much simpler as @andrewn mentioned.

mentioned in issue gitlab-com/infrastructure#2010

I added this issue https://gitlab.com/gitlab-com/infrastructure/issues/2010 to build an automation to replace the drives.

@andrewn the only concern I have with using tmpfs is that we will be accepting writes into this drive, and when the original NFS drives recovers we will need to get things in sync, which in case of using tmpfs would mean that we will need to sync all the possible hosts that are mounting their own tmpfs. I think that a better approach would be to replace one NFS mountpoint with another NFS. That way we can sync data from a single place.

@ernstvn I think there are effectively two concerns being wrapped up here if I understand everything correctly.

Don't take the whole server down by locking up workers when there is an NFS outage, (https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_31373940)
Actual graceful degradation from a UX perspective

I'll make the assumption that 1 is the most important and @DouweM correct me if I'm wrong, but that's effectively what @reprazent said he was working on in !11449 (merged). In this case, whilst it's still unhelpful to a user who experiences this and receives an error, it's helpful to the wider server health and availability.

We can certainly then look at 2 at a later date, but right now we're just completely swamped and under-resourced with all of the work we have going on with licensing to do that.

If 1 is happening right now, can we then agree that we can push this issue out into a future milestone?

Note this is also related to the conversation we had regarding https://gitlab.com/gitlab-com/infrastructure/issues/1943 whereby we're struggling how to prioritise these issues. /me thinks it's time for a conversation on this in real time so we can get to a better place on how we can help each other with production performance & availability issues. Spinning 10-minute cycles asynchronously on the topic doesn't seem to be moving us forward.

@reprazent Can you comment on how your work in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 applies to this discussion?

mentioned in merge request !11449 (merged)

@mydigitalself your async 10 minute cycles are very valuable, and once again you've hit the nail on the head. Yes, doing

-1. Don't take the whole server down by locking up workers when there is an NFS outage,

now would take care of the key concern, and yes

-2. Actual graceful degradation from a UX perspect

can wait.

Left a note in gitlab-com/infrastructure#1943 (closed), that one should be moving ahead quickly as well.

@mydigitalself as an example of how I'd apply https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019/diffs#98b285708e11c850a9ea7cdb801a8ffc267dd52a_72_85 (from gitlab-com/infrastructure#1943 (closed)) here, I'd rate it as U2 (outage likely within three months) and I1 (outage likely to last ~1 hr), giving it an AP1 label. \cc @pcarranza

Once it is confirmed that !11449 (merged) takes care of the first step, I'll gladly spawn a separate new issue for the UX side of things, and close this issue.

@DouweM

Can you comment on how your work in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 applies to this discussion?

This MR would fix 1 as @mydigitalself mentioned https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_32225275.

So instead of having ngnix not finding a unicorn, we'd have rails render a 503 and try to avoid accessing the misbehaving FS. That means that instead of GitLab becoming unavailable, just the pages that touch something on the misbehaving FS become unavailable.

@sarrahvesselov @mydigitalself is this a deliverable or stretch? it has milestone 9.4

Once it is confirmed that !11449 (merged) takes care of the first step, I'll gladly spawn a separate new issue for the UX side of things, and close this issue

From this comment @dimitrieh, it looks as though there is no UX at issue here. Once the pending merge request is taken care of, this will be closed and a UX issue opened. Thanks for checking in on this!

removed UX label

@DouweM what's the status of this one, I thought it was actively being worked on, but the issue isn't assigned.

changed milestone to %9.5

assigned to @reprazent

added Deliverable label

changed the description

Closing per https://gitlab.com/gitlab-org/gitlab-ce/issues/33220#note_32262075.

closed

mentioned in issue #36321

Created https://gitlab.com/gitlab-org/gitlab-ce/issues/36321 to follow up on the UX side.

Graceful degradation when git-repo can't be reached

Description

Proposal

Deliverable

Links / references

Documentation blurb

Designs

Child items 0

Activity

Admin message

Admin message

Graceful degradation when git-repo can't be reached

Description

Proposal

Deliverable

Links / references

Documentation blurb

Activity