@stanhu said he expired the cache and the project became available. This mimics previous times where we've had NFS mount issues and an access request was issued against an "unavailable" mount and then gets cached as "non-existant".
@jtevnan where did we stand with monitoring if all the NFS mounts for a system were viable and alerting if not?
It only monitors LFS shares for the moment as that's where the mounts were causing under-writes. It can be adapted to monitor more shares. It's still a manual install running from the crontab belonging to my user account on each worker.
So the gist of that blog post is that the NFS timeo should not be smaller than the TCP timeout and we should rely on the TCP protocol to retransmit PACKETS vs. the nfs aproach of resending the whole RPC call.
I see the wisdom there and it may be the correct approach in our fragile network environment. In that case it may make sense. It is still my opinion that we should not keep the default of 60 seconds to timeout. In 1 minute's time, the application will still be accepting connections from the outside, filling up application threads, (which all in turn create an RPC call which will take 60 seconds to timeout) rather than failing faster.
@jtevnan If the choice is between connections piling up and repos being marked empty I'd lean towards the former. I think it would be worth testing this on staging to see what happens. The NFS server could have a temporary rule enabled to block traffic to NFS and then a user could try to access a repo.
One of the things we can do to improve things is to adjust TCP keepalives in the kernel. This helps with sockets that typically should have low latency with each other.
Another thing, we should possibly fix the git client such that it can tell the difference between a broken connection and an empty repo. Maybe this is a Gitaly thing?
Another thing, we should possibly fix the git client such that it can tell the difference between a broken connection and an empty repo. Maybe this is a Gitaly thing?
@bjk-gitlab since the plan is for Gitaly to run on the NFS servers there will not be NFS timeouts, but there will be gRPC timeouts. What codes generates the message The repository for this project does not exist.? We should have this issue in mind when migrating that code.
all with the same outcome: after the initial 502, we started throwing 500s, and after the port was unblocked, the repo was fine. Sometimes a restart of unicorn was needed, but in general, there was no problem that i could track.
I also tried e.g. project/import on a project that i knew didnt exist yet on the nfs server. nothing was able to reproduce the error with timeouts.
It seems that I am unable to trigger the correct functions which should reset the cache for a repo. @stanhu can someone give us a hand in testing this?
I suspect what's happening here is that a push happens, we expire the caches in situations (e.g. when it's time to garbage collect), and if the repo doesn't happen to be there when we next look, it gets cached as non-existent.
This work should be fairly orthogonal to the changes in the caching strategy, although once it's been ported I would suggest we do a benchmark of Redis caching on the exists? method vs simple memoization + gitaly backend. For such a straightforward request (1 fstat) the overhead of Redis may not be worthwhile.
I wonder if we should expire the cache only during a few specific operations on the repository, instead of making it possible to expire cache on each request.
@grzesiek and i managed to reproduce this in staging:
we dropped the nfs timeouts to 1 second (timeo=10) and rebooted web01
we confirmed we were both able to see https://staging.gitlab.com/gitlab-org/gitlab-ce
we blocked added a drop rule to the nfs server on port 111 and 2049 (as before)
refreshed https://staging.gitlab.com/gitlab-org/gitlab-ce and almost immediately received a 500 error
verified the Gitlab::Git::Repository::NoRepository error on sentry (see bellow)
this exception triggered the expiry of the repo_exists? key
trace:
Rugged::OSError: Failed to resolve path '/var/opt/gitlab/git-data-file03/repositories/gitlab-org/gitlab-ce.git': Input/output error from lib/gitlab/git/repository.rb:65:in `new' from lib/gitlab/git/repository.rb:65:in `rugged' from app/models/repository.rb:450:in `method_missing' from app/models/repository.rb:217:in `ref_exists?' from app/models/merge_request.rb:804:in `ref_fetched?' from app/models/merge_request.rb:808:in `ensure_ref_fetched' from app/controllers/projects/merge_requests_controller.rb:702:in `ensure_ref_fetched'
@grzesiek I'm probably misunderstanding this, but... Is your suggestion to not to expire the cache on each request (or call to exists?), so if the repo is not available intermittently, we still report that it's still there?
If that's the reasoning, I wonder if we should cache that... So for X time, it reports the latest status - this will prevent accessing the disk that often for this check, and better differentiate intermittent NFS errors from genuine missing repo ones.
@stanhu@grzesiek looks like a related issue here ... exists? is overloaded. there is at least a four-state logic instead of binary state ... (1) does not exist, (2) created and present, (3) created and not present (AWOL), (4) does not "exist", yet is present (other root cause)
We have Project#repo_exists?, Project#repository_exists?, Project#repository, Project#repo etc. Some methods catch exceptions, some don't, some invalidate cache, some do not. We need a tech debt issue about it.
@grzesiek the change has the expected effect and might be a solution to the problem, however I am a bit worried about the unicorn process returning 500s after the test.
@grzesiek Yes that's correct, it would solve the incorext exists? cache, and tries to prevent filesystem access to the failing system in order to keep the unicorns happy.
@jtevnan@grzesiek@reprazent by making a quick read here what I'm understanding is that this is something that needs a resolution from the application side rather than increasing limits of the infrastructure to an unreasonable setting.
If this is so, can we just close this issue and point to the right one that will fix the application behavior? If not, could someone point at what is the next action from the production side here?
I'm asking because this issue has been dragged for 3 weeks already, and it doesn't look like it will have a resolution for 3 weeks at least, which is when we will release the next version.
@pcarranza@grzesiek gitlab-org/gitlab-ce!11449 is being worked on by @reprazent, but is currently not his top priority and likely will not be until the end of the 9.4 development month (July 7th). After that, I expect him to get back to it.