A cache should never return false, so the additional check shouldn't be necessary. For existing setups we can either (whichever is better):
Schedule a migration to flush the "exists" cache
What would a migration that flushes the 'exists' cache look like? Is that really feasible at gitlab.com's scale?
Just wait since caches expire after 2 weeks
The real world problem that prompted this discussion was a repository that had become completely inaccessible for 'git push' or 'git pull', let alone browsing via the web UI. This is not the sort of glitch a user wants to wait 2 weeks for to self-heal. "Sorry, no updates to www-gitlab-com for the next 2 weeks."
I think we need to act under the assumption that the existing rails cache for Repository#exists? contains invalid false values. So that means we either design a new solution that handles the invalid cache data, or we move to a new cache key.
@jacobvosmaer-gitlab The cache key is based on the method so that's not all that easy. I don't really see a problem with flushing the cache, it may take a few minutes but it's not that big of a deal.
@jacobvosmaer-gitlab@rspeicher Hm I missed that, though it does seem to use a delay of sorts. I think in the past when we ran rake cache:clear it would usually complete much faster.
We could rename the method? Add a version number to it?
That means having to change all the code that uses it, and doing that every time we make changes to the internals. This doesn't scale very well, I'd rather flush the cache if we can do this in e.g. 10 minutes. We can add a version number to RepositoryCache, but then it will affect all caches.
The reason we introduced a delay in the cache clear was that a high number of writes would cause Redis to failover to a new master: gitlab-com/infrastructure#1682 (closed)
This may no longer be an issue now that we doubled the RAM in each of the Redis cluster, but we have not tested this.
Not caching the false values as @stanhu mentions would avoid repositories appearing missing. But it would still try to access the filesystem, causing the timeout of the mount to go up and blocking requests.
Currently, if an access to a repository failed once, we won't try it again because of the cache. Which means we don't touch the failing FS and the unicorn serves the request without blocking. I'm worried if we would remove this cache the problem would snowball into a bigger problem.
Not caching the false values as @stanhu mentions would avoid repositories appearing missing. But it would still try to access the filesystem, causing the timeout of the mount to go up and blocking requests.
This is a good point. This might be an argument to cache the false value for a short time (e.g. 1 minute) to allow some recovery time.