Circuitbreaker to avoid and monitor access to stale NFS mounts

mentioned in merge request !10971 (closed)

I'm currently checking this out on a VM.

Test setup:

I've set up a VM that mounted an NFS with these options:

10.8.124.11:/var/nfs /mnt/repositories/ nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1,timeo=10,retrans=2,soft 0 0

The relevant things here are:

soft:

If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

That prevents the the system from hanging waiting for the NFS to reply again.

timeo=10:

The time in deciseconds (tenths of a second) the NFS client waits for a response before it retries an NFS request.

So we only wait for 1 second for a reply.

The combination of these 2 params make sure we only wait 2 seconds for the NFS.

Results

The following repository was available:

[1] pry(main)> Rugged::Repository.new('/home/gitlab/gitlab-development-kit/repositories/gitlab-org/gitlab-test.git')
=> #<Rugged::Repository:47045869649640 {path: "/mnt/repositories/gitlab-org/gitlab-test.git/"}>

When disconnecting the NFS: sudo systemctl stop nfs-kernel-server:

[3] pry(main)> Rugged::Repository.new('/home/gitlab/gitlab-development-kit/repositories/gitlab-org/gitlab-test.git')
Rugged::OSError: Failed to resolve path '/home/gitlab/gitlab-development-kit/repositories/gitlab-org/gitlab-test.git': Input/output error
from (pry):3:in `new'

May 15 12:14:37 gitlab-dev kernel: [ 1635.437873] nfs: server 10.8.124.11 not responding, timed out
May 15 12:14:40 gitlab-dev kernel: [ 1638.447773] nfs: server 10.8.124.11 not responding, timed out
May 15 12:14:40 gitlab-dev kernel: [ 1638.447878] nfs: server 10.8.124.11 not responding, timed out

Sadly, the Rugged::OSError is the same one raised when a repository doesn't exist albeit with a different message.

=> #<Rugged::Repository:47045869649640 {path: "/mnt/repositories/gitlab-org/gitlab-test.git/"}>
[2] pry(main)> Rugged::Repository.new('/home/gitlab/gitlab-development-kit/repositories/gitlab-org/gitlab-non-existant.git')
Rugged::OSError: Failed to resolve path '/home/gitlab/gitlab-development-kit/repositories/gitlab-org/gitlab-non-existant.git': No such file or directory

We could however detect if a mount is unavailable by checking the Pathname#realpath of the mount-point when we notice the repository is not accessible.

[4] pry(main)> Pathname.new('/home/gitlab/gitlab-development-kit/repositories').realpath
Errno::EIO: Input/output error @ realpath_rec - /mnt/repositories
from (pry):4:in `realpath'
[5] pry(main)> Pathname.new('/mnt/repositories').realpath
Errno::EIO: Input/output error @ realpath_rec - /mnt/repositories
from (pry):5:in `realpath'

If we track this Errno::EIO error in the CircuitBreaker we know when an NFS was offline.

Questions:

Does this implementation sound correct?

How are we currently mounting NFSs on GitLab.com, are we using soft so the error gets raised?

Does the accessibility state of an NFS apply globally? Or is it specific to one host?

This is how we mount NFS shares on all our nodes. The only difference is that our current timeout is set to 50.

I've updated the issues with what I plan on implementing based on what I discussed with @jtevnan and @omame today.

changed milestone to %9.1

mentioned in merge request !11449 (merged)

mentioned in issue #33117 (closed)

changed milestone to %9.4

mentioned in merge request !12512 (closed)

mentioned in issue gitlab-com/infrastructure#1946 (closed)

added Deliverable label

changed milestone to %9.5

closed via merge request !11449 (merged)

mentioned in commit 5bf65c93

mentioned in issue gitlab-com/support-forum#2352 (closed)

mentioned in merge request !14785

Circuitbreaker to avoid and monitor access to stale NFS mounts

Designs

Child items ...

Activity

Test setup:

Results

Questions:

Admin message

Admin message

Circuitbreaker to avoid and monitor access to stale NFS mounts

Activity

Test setup:

Results

Questions: