Circuitbreaker to avoid and monitor access to stale NFS mounts
When an NFS currently becomes inaccessible, the exists?
cache is being incorrectly set when trying to access a repository on that NFS. Causing the repository to appear missing, as described in https://gitlab.com/gitlab-com/infrastructure/issues/1775.
To avoid this, we can wrap the repository NFS calls in a CircuitBreaker as suggested by @pcarranza.
If the repository call fails because of a Rugged::OSError
we will check if the repository_storage_path
is available using Pathname#realpath
. If it is, we will just raise the exception, since it could just mean the repository is missing, which is acceptable.
If the check using Pathname#realpath
raises an Errno::EIO
we will store the the time for this exception and increment a failure counter in redis. And raise the exception wrapped in a StorageNotAvailable
exception.
The exception information stored in redis needs to be specific to a host and a storage path. Because having an NFS not respond on 1 system, does not mean it wont respond on another host.
If we have a second request, but the last failure was less then 5 seconds ago, we will not retry accessing the storage, again since the calls to the storage take a long time when the NFS is not responding. This will avoid clogging up the web-workers.
If there are more than 10 subsequent failures for the same storage on the same host, the request will be blocked since there's probably something wrong with the storage in that case and we raise a different exception.
There could be a view in the admin panel that shows the number of issues with a certain storage. And a button to clear the stored exceptions so the requests are executed again.
Next to that, we need to make sure the ruby app starts when a storage is not available. It is currently being blocked in the 06_validations
initializer with the same Errno::EIO
exception.