NFS timeouts possibly causing missing repositories

added (perceived) data loss and removed outage labels

FYI, https://gitlab.com/gitlab-com/support-forum/issues/1914 was using file03.

@stanhu said he expired the cache and the project became available. This mimics previous times where we've had NFS mount issues and an access request was issued against an "unavailable" mount and then gets cached as "non-existant".

@jtevnan where did we stand with monitoring if all the NFS mounts for a system were viable and alerting if not?

Another reported case of a stale lock file: https://gitlab.com/gitlab-com/support-forum/issues/1804#note_31650060

/cc: @briann

changed milestone to %WoW ending 2017-06-13

added storage label

@northrup The dashboard monitoring NFS mounts is here: https://log.gitlap.com/goto/29a632e37181bd0431cdba13b9a5fefa

It only monitors LFS shares for the moment as that's where the mounts were causing under-writes. It can be adapted to monitor more shares. It's still a manual install running from the crontab belonging to my user account on each worker.

Thanks @briann i was not aware of that dashboard... I verified that all connected nfs clients had the correct settings and they do.

I am still under the impression that its a noisy neighbor issue, where the servers have a high latency for short periods of time.

We can bump the timeo variable further, and take care of the ordering (issue: https://gitlab.com/gitlab-com/infrastructure/issues/1915) in one go, but it will mean rebooting all the servers again. We can also move all the settings to the defaults (as per https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/665) to ensure we dont miss anything.

@northrup bumpin the timeo from 50 to 100 ? (again, 10 seconds is alot of time to wait)

I brought this article up the last time we discussed NFS timeouts: http://www.eall.com.br/blog/?p=2338

IMHO the timeout should be much higher.

So the gist of that blog post is that the NFS timeo should not be smaller than the TCP timeout and we should rely on the TCP protocol to retransmit PACKETS vs. the nfs aproach of resending the whole RPC call.

I see the wisdom there and it may be the correct approach in our fragile network environment. In that case it may make sense. It is still my opinion that we should not keep the default of 60 seconds to timeout. In 1 minute's time, the application will still be accepting connections from the outside, filling up application threads, (which all in turn create an RPC call which will take 60 seconds to timeout) rather than failing faster.

cc @bjk-gitlab @gl-infra

changed milestone to %WoW ending 2017-06-20

@jtevnan If the choice is between connections piling up and repos being marked empty I'd lean towards the former. I think it would be worth testing this on staging to see what happens. The NFS server could have a temporary rule enabled to block traffic to NFS and then a user could try to access a repo.

One of the things we can do to improve things is to adjust TCP keepalives in the kernel. This helps with sockets that typically should have low latency with each other.

Another thing, we should possibly fix the git client such that it can tell the difference between a broken connection and an empty repo. Maybe this is a Gitaly thing?

@briann I agree: our focus should be on stability for the end user: so yes, bumping the NFS timeout will make it more stable for them.

As you mentioned we can simulate the timeout on staging and (as @bjk-gitlab mentioned) we can adapt the kernel side timeouts.

Since all the repos are on a single nfs server on staging, we can essentially try it with any repo.

Let me understand the test completely:

check the existence of a repo: https://staging.gitlab.com/gitlab-org/omnibus-gitlab
close the nfs port on the nfs server
verify that we get an error on the repo
unblock the nfs port
verify that the repo is back again.

Is that correct?

@jtevnan In this case it would be:

check the existence of a repo: https://staging.gitlab.com/gitlab-org/omnibus-gitlab
close the nfs port on the nfs server
verify that we get an error on the repo with a web request and a test push or clone
unblock the nfs port
verify that the repo is now marked as empty in the cache.

verify that the repo is now marked as empty in the cache. --> browse the web page which should be empty.

I will do this this afternoon.

Another thing, we should possibly fix the git client such that it can tell the difference between a broken connection and an empty repo. Maybe this is a Gitaly thing?

@bjk-gitlab since the plan is for Gitaly to run on the NFS servers there will not be NFS timeouts, but there will be gRPC timeouts. What codes generates the message The repository for this project does not exist.? We should have this issue in mind when migrating that code.

/cc @andrewn

I tried the steps above:

check repo
block nfs (both with port or with ip range)
try repo access (clone/web/push)
unblock
check out come.

with different repos: https://staging.gitlab.com/gitlab-org/omnibus-gitlab https://staging.gitlab.com/gitlab-org/gitlab-ce https://staging.gitlab.com/jason.tevnan/cookbook-percona

all with the same outcome: after the initial 502, we started throwing 500s, and after the port was unblocked, the repo was fine. Sometimes a restart of unicorn was needed, but in general, there was no problem that i could track.

I also tried e.g. project/import on a project that i knew didnt exist yet on the nfs server. nothing was able to reproduce the error with timeouts.

It seems that I am unable to trigger the correct functions which should reset the cache for a repo. @stanhu can someone give us a hand in testing this?

It seems that I am unable to trigger the correct functions which should reset the cache for a repo.

https://gitlab.com/gitlab-org/gitlab-ce/blob/v9.2.2/app/models/repository.rb#L337-338 is the method that is called to expire the exists? cache. It is generally only called when a push happens, which may explain why you're not able to reproduce just by taking off and on the repo access.

I suspect what's happening here is that a push happens, we expire the caches in situations (e.g. when it's time to garbage collect), and if the repo doesn't happen to be there when we next look, it gets cached as non-existent.

Perhaps @grzesiek can help you here.

I'm assigning this as this is being worked on. For now to the people who is involved in the discussion or that we are waiting to chime in.

Please change as needed.

assigned to @jtevnan and @grzesiek

@stanhu just a heads up that we're intending to migration the repository.exists? implementation in Gitaly, here https://gitlab.com/gitlab-org/gitaly/issues/314

This work should be fairly orthogonal to the changes in the caching strategy, although once it's been ported I would suggest we do a benchmark of Redis caching on the exists? method vs simple memoization + gitaly backend. For such a straightforward request (1 fstat) the overhead of Redis may not be worthwhile.

@andrewn Great! What timing--what about branch_count and branch_names? https://gitlab.com/gitlab-org/gitlab-ce/issues/33799 :)

@stanhu branch_names has been implemented: `https://gitlab.com/gitlab-org/gitaly-proto/blob/master/ref.proto#L10

For now, we're leaving branch_count as a simple count of results in branch_names. In future, we may optimise this further.

Having said that, I think the Gitaly implementation is orthogonal to the caching issues in https://gitlab.com/gitlab-org/gitlab-ce/issues/33799, so that will need to be fixed Gitaly or not.

@jtevnan @stanhu I managed to reproduce this problem in GDK:

Steps to reproduce this bug

Go to project page.
Block NFS or simply mv repositories/my_group repositories/my_group_ (this is what I did in GDK since I have no NFS locally).
Go to project page - 💥 no repository.
Unblock NFS or mv repositories/my_group_ repositories/my_group.
Go to project page - 💥 no repository.

Why these steps?

Repository exists cache is expired in an unexpected place.

See app/controllers/projects_controller.rb

We have a method repo_exists? in ProjectsController

  def repo_exists?
    project.repository_exists? && !project.empty_repo? && project.repo

  rescue Gitlab::Git::Repository::NoRepository
    project.repository.expire_exists_cache

    false
  end

It is triggered in the before action before_action :assign_ref_vars, only: [:show], if: :repo_exists?

When there is no repository we expire the cache on each request. Which leads to

[6] pry(main)> p.repository.exists?
=> false
[7] pry(main)> p.repository._uncached_exists?
=> true
[8] pry(main)>

I wonder if we should expire the cache only during a few specific operations on the repository, instead of making it possible to expire cache on each request.

@grzesiek and i managed to reproduce this in staging:

we dropped the nfs timeouts to 1 second (timeo=10) and rebooted web01
we confirmed we were both able to see https://staging.gitlab.com/gitlab-org/gitlab-ce
we blocked added a drop rule to the nfs server on port 111 and 2049 (as before)
refreshed https://staging.gitlab.com/gitlab-org/gitlab-ce and almost immediately received a 500 error
verified the Gitlab::Git::Repository::NoRepository error on sentry (see bellow)
this exception triggered the expiry of the repo_exists? key

trace:

Rugged::OSError: Failed to resolve path '/var/opt/gitlab/git-data-file03/repositories/gitlab-org/gitlab-ce.git': Input/output error
  from lib/gitlab/git/repository.rb:65:in `new'
  from lib/gitlab/git/repository.rb:65:in `rugged'
  from app/models/repository.rb:450:in `method_missing'
  from app/models/repository.rb:217:in `ref_exists?'
  from app/models/merge_request.rb:804:in `ref_fetched?'
  from app/models/merge_request.rb:808:in `ensure_ref_fetched'
  from app/controllers/projects/merge_requests_controller.rb:702:in `ensure_ref_fetched'

I will try to test this with different timeo as well.

Is it possible to search Sentry for that error for the first week of May? I don't see it in the application logs for that time period.

mentioned in issue #2060 (closed)

changed milestone to %WoW ending 2017-06-27

Should we create a merge request in GitLab CE avoid possibility of expiring cache on each request? See https://gitlab.com/gitlab-org/gitlab-ce/commit/29141ed3ea6157a60d9748921782015626a17f9e, what do you think @jameslopez?

@grzesiek I'm probably misunderstanding this, but... Is your suggestion to not to expire the cache on each request (or call to exists?), so if the repo is not available intermittently, we still report that it's still there?

If that's the reasoning, I wonder if we should cache that... So for X time, it reports the latest status - this will prevent accessing the disk that often for this check, and better differentiate intermittent NFS errors from genuine missing repo ones.

Maybe the better solution is to check a repository existence if the cached value is false. Similar solution to https://gitlab.com/gitlab-org/gitlab-ce/issues/33117. What do you think?

Yep, that makes sense. This won't prevent the NFS errors to manifest themselves by other means, but it should fix the missing repo problem.

@stanhu @grzesiek looks like a related issue here ... exists? is overloaded. there is at least a four-state logic instead of binary state ... (1) does not exist, (2) created and present, (3) created and not present (AWOL), (4) does not "exist", yet is present (other root cause)

@jtevnan Can you check if following patch resolves this problem on staging?

diff --git a/app/controllers/projects_controller.rb b/app/controllers/projects_controller.rb
index 5480814874..fedbc1fe08 100644
--- a/app/controllers/projects_controller.rb
+++ b/app/controllers/projects_controller.rb
@@ -335,10 +335,7 @@ class ProjectsController < Projects::ApplicationController
 
   def repo_exists?
     project.repository_exists? && !project.empty_repo? && project.repo
-
   rescue Gitlab::Git::Repository::NoRepository
-    project.repository.expire_exists_cache
-
     false
   end

We have Project#repo_exists?, Project#repository_exists?, Project#repository, Project#repo etc. Some methods catch exceptions, some don't, some invalidate cache, some do not. We need a tech debt issue about it.

Status:

I tested the nfs timeouts on staging again and was able to reproduce the problem:

irb(main):004:0> Project.find_by_full_path('gitlab-org/gitlab-ce').repository.exists?
=> false
irb(main):005:0> Project.find_by_full_path('gitlab-org/gitlab-ce').repository.expire_exists_cache
=> [:exists?]
irb(main):006:0> Project.find_by_full_path('gitlab-org/gitlab-ce').repository.exists?
=> true
irb(main):007:0>

after this, I commented the line @grzesiek from /opt/gitlab/embedded/service/gitlab-rails/app/controllers/projects_controller.rb and retried the call.

I received a 502 (timeout) which is fine but the repository did not get marked as non-existent:

irb(main):008:0> Project.find_by_full_path('gitlab-org/gitlab-ce').repository.exists?
=> true

I did however, have to bounce the unicorn process:

root@web01.stg.gitlab.com:~# sudo gitlab-ctl restart unicorn
ok: run: unicorn: (pid 62031) 1s

@grzesiek the change has the expected effect and might be a solution to the problem, however I am a bit worried about the unicorn process returning 500s after the test.

changed milestone to %WoW ending 2017-07-04

This problem should be resolved by https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449.

Is that correct @reprazent?

@grzesiek Yes that's correct, it would solve the incorext exists? cache, and tries to prevent filesystem access to the failing system in order to keep the unicorns happy.

@reprazent Awesome! Do you mind me assigning this issue to you? 😸

@grzesiek https://gitlab.com/gitlab-org/gitlab-ce/issues/32207 https://gitlab.com/gitlab-org/gitlab-ce/issues/33117 One more to close in a single MR couldn't hurt!

But I'm currently working on the 9.4 deliverables concerning pricing, so I'm not sure when I'll be able to continue https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449

assigned to @reprazent

@DouweM @stanhu Should we in this case prioritize https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449?

unassigned @jtevnan and @grzesiek

@grzesiek you have it prioritized from the infra perspective.

@jtevnan @grzesiek @reprazent by making a quick read here what I'm understanding is that this is something that needs a resolution from the application side rather than increasing limits of the infrastructure to an unreasonable setting.

If this is so, can we just close this issue and point to the right one that will fix the application behavior? If not, could someone point at what is the next action from the production side here?

I'm asking because this issue has been dragged for 3 weeks already, and it doesn't look like it will have a resolution for 3 weeks at least, which is when we will release the next version.

@pcarranza The issue is https://gitlab.com/gitlab-org/gitlab-ce/issues/32207, nevertheless I think that we should keep this issue open until associated merge request gets merged / or until we confirm that it works in production.

@pcarranza @grzesiek gitlab-org/gitlab-ce!11449 is being worked on by @reprazent, but is currently not his top priority and likely will not be until the end of the 9.4 development month (July 7th). After that, I expect him to get back to it.

added app bug label

removed milestone

Right, I will pull it out of our WoW because it is simply not actionable by us.

Thanks all.

NFS timeouts possibly causing missing repositories

Designs

Child items 0

Activity

Steps to reproduce this bug

Why these steps?

Repository exists cache is expired in an unexpected place.

Admin message

Admin message

NFS timeouts possibly causing missing repositories

Activity

Steps to reproduce this bug

Why these steps?

Repository exists cache is expired in an unexpected place.