Don't cache `Repository#exists?` value if value is `false`

mentioned in issue gitlab-com/infrastructure#1782 (closed)

@mydigitalself This is simple enough that I think we should just do this for 9.3.

@stanhu I'm not sure we have any bandwidth left in 9.3. Is it urgent? Can certainly prioritise for 9.4 cc @DouweM

added availability ~18308 labels

Responding to @yorickpeterse in https://gitlab.com/gitlab-com/infrastructure/issues/1782#note_31076987

A cache should never return false, so the additional check shouldn't be necessary. For existing setups we can either (whichever is better):

Schedule a migration to flush the "exists" cache

What would a migration that flushes the 'exists' cache look like? Is that really feasible at gitlab.com's scale?

Just wait since caches expire after 2 weeks

The real world problem that prompted this discussion was a repository that had become completely inaccessible for 'git push' or 'git pull', let alone browsing via the web UI. This is not the sort of glitch a user wants to wait 2 weeks for to self-heal. "Sorry, no updates to www-gitlab-com for the next 2 weeks."

I think we need to act under the assumption that the existing rails cache for Repository#exists? contains invalid false values. So that means we either design a new solution that handles the invalid cache data, or we move to a new cache key.

What would a migration that flushes the 'exists' cache look like? Is that really feasible at gitlab.com's scale?

The migration would have to scan over the keys that match the pattern exists:*, then delete those keys. So something like this:

Gitlab::Redis.with do |redis|
  cursor = '0'

  loop do
    cursor, keys = redis.scan(cursor, match: 'exists:*', count: 1000)

    redis.del(*keys) unless keys.empty?

    break if cursor == '0'
  end
end

Responding to @pcarranza in https://gitlab.com/gitlab-com/infrastructure/issues/1782#note_31072550

What do you mean? how would this look like in prometheus?

@pcarranza I was thinking of a counter that tracks how often Repository#exists? returns true or false.

@yorickpeterse Based on what's in https://gitlab.com/gitlab-com/infrastructure/issues/1631#note_30510313 and https://gitlab.com/gitlab-com/infrastructure/issues/1631#note_30622014 I assume we're talking about cache:gitlab:exists?:*?

When I last cleared that it took about 16 hours to complete.

I think the easiest and cleanest thing is to pick a different cache key for exists? and let the old values expire over the course of two weeks.

@jacobvosmaer-gitlab right, I agree, it would be great, but we need to have prometheus support in the rails app first.

Should we open an issue for this specifically?

@pcarranza hmm no let's wait until prometheus is a thing. There is no real pressing need for this particular metric, an issue would be dead weight.

Issue to discuss alternative cache clearing strategies https://gitlab.com/gitlab-org/gitlab-ce/issues/33151

@rspeicher Euh yeah, I forgot the prefix.

@jacobvosmaer-gitlab The cache key is based on the method so that's not all that easy. I don't really see a problem with flushing the cache, it may take a few minutes but it's not that big of a deal.

it may take a few minutes

@yorickpeterse didn't @rspeicher just say it takes 16 hours? Or am I misunderstanding something.

The cache key is based on the method so that's not all that easy

We could rename the method? Add a version number to it?

@jacobvosmaer-gitlab @rspeicher Hm I missed that, though it does seem to use a delay of sorts. I think in the past when we ran rake cache:clear it would usually complete much faster.

We could rename the method? Add a version number to it?

That means having to change all the code that uses it, and doing that every time we make changes to the internals. This doesn't scale very well, I'd rather flush the cache if we can do this in e.g. 10 minutes. We can add a version number to RepositoryCache, but then it will affect all caches.

The reason we introduced a delay in the cache clear was that a high number of writes would cause Redis to failover to a new master: gitlab-com/infrastructure#1682 (closed)

This may no longer be an issue now that we doubled the RAM in each of the Redis cluster, but we have not tested this.

@stanhu Is this the same issue as https://gitlab.com/gitlab-com/infrastructure/issues/1775 for which MR https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 is currently open?

/cc @reprazent

@DouweM I think https://gitlab.com/gitlab-com/infrastructure/issues/1775 is for tracking the NFS problem and avoiding the problem on an infrastructure level.

!11449 (merged) Tries avoiding touching the FS when its out, as that snowballs into timeouts and will clog up unicorns. I created https://gitlab.com/gitlab-org/gitlab-ce/issues/32207 based on @pcarranza's comments https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/10971#note_28599130.

Not caching the false values as @stanhu mentions would avoid repositories appearing missing. But it would still try to access the filesystem, causing the timeout of the mount to go up and blocking requests.

Currently, if an access to a repository failed once, we won't try it again because of the cache. Which means we don't touch the failing FS and the unicorn serves the request without blocking. I'm worried if we would remove this cache the problem would snowball into a bigger problem.

@reprazent Do you have time to look at this?

changed milestone to %9.3

@DouweM Sure, I'll continue !11449 (merged) which should also solve the cause of this.

But I think we'll also need to find a way to invalidate the incorrect cache.

assigned to @reprazent

Not caching the false values as @stanhu mentions would avoid repositories appearing missing. But it would still try to access the filesystem, causing the timeout of the mount to go up and blocking requests.

This is a good point. This might be an argument to cache the false value for a short time (e.g. 1 minute) to allow some recovery time.

mentioned in issue #33220 (closed)

mentioned in issue gitlab-com/infrastructure#2010

mentioned in issue gitlab-com/infrastructure#1946 (closed)

changed milestone to %9.5

mentioned in merge request !11449 (merged)

closed via merge request !11449 (merged)

mentioned in commit 5bf65c93

mentioned in issue #34265 (closed)

mentioned in issue gitlab-com/infrastructure#2507 (closed)

mentioned in merge request !14785

Don't cache `Repository#exists?` value if value is `false`

Designs

Child items ...

Activity

Admin message

Admin message

Don't cache `Repository#exists?` value if value is `false`

Activity