At 09:28 UTC we noticed that gitlab-com/www-gitlab-com was unavailable. A quick investigation found out that the cache that rails uses was corrupt. We quickly fixed this by manually expiring the cache and at 10:26 the repository went back online.
To-do's:
Check if other repositories are experiencing the same issue and expire the caches for all of them.
Write a runbook that explains how to perform the manual expiration in case of necessity.
Questions:
Why did the cache get corrupt?
How can we avoid this from happening again in the future?
How can we monitor this?
Who is the point of contact for this kind of issue?
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
No child items are currently open.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Activity
Sort or filter
Newest first
Oldest first
Show all activity
Show comments only
Show history only
username-removed-897310changed title from {-www-gitlab-com-} unaccessible to www-gitlab-com unaccessible
changed title from {-www-gitlab-com-} unaccessible to www-gitlab-com unaccessible
A blanket 'fix', to repair other poisoned caches, could be to run gitlab-rake cache:clear. This command uses Redis SCAN to find and delete all Rails cache keys from Redis.
It doesn't fix the application problem but at least it does not rely on users reporting problematic repositories to us. I think it is the simplest thing that fixes the symptoms.
@jacobvosmaer-gitlab we can't do this anymore because we have a 55G Redis instance that is keeping everything. If we issue such command Redis gets out of sync triggering a failover which will end up in 15m downtime.
I opened this issue asking to split cache from persistent data in Redis, when that happen we can just blow the cache Redis instance and wipe the cache that way.
Related to this were the 500 errors on attempting to view an MR in www-gitlab-com. Scanning through Sentry, I wasn't able to find them immediately, but I did come across this, which may be related:
Should we run this in a for to blow all the method caches? could we decrease the scope somehow? could we make it slow enough so it doesn't impact redis badly?
We would need then to need to check whether the repository is there via the Rails console:
p = Project.find(project_id)p.repository.exists?system('ls -al #{p.repository.path}')
If we find a false positive, then the IP in question is there.
We may need to isolate this further by trying to focus on one repository where we repeatedly push new tags and see if we can make exists? become false.
In this particular case, running a single MONITOR client can reduce the throughput by more than 50%. Running more MONITOR clients will reduce throughput even more.
Right. I meant if we ran a loop looking through the cache, like you showed above. I'm running the #expire_content_cache on repositories as needed to fix people up
I'm also wondering if all these NFS issues recently are related. For example, if the application hits some I/O error opening the repository for some reason, I wonder if the exists? state may incorrectly be set to false.
The alternative is to look at each key and clear the state if it is false use a Redis multi-read. I don't quite understand the serialization format, but maybe it's enough to look for the string valueF.
The majority of these messages appear to be coming from one host:
Aside: currently the hostname field in Kibana is set to not-aggregatable. This means that it's difficult to visualise errors by host. We should fix this as it's an easy way to spot problems.
The problematic timeout (timeo=10 == a pretty reasonable 1 sec) was increased yesterday to timeo=50 (5 sec) here and @northrup did a rolling reboot of our nfs clients to be sure that the new settings are active. This should definitely mitigate the timeouts.
i just checked the git hosts, and which the new setting of 5(!) seconds 3/5 servers still showed problems reaching shares:
I would ask Azure at this point as this is abnormal network response time for hosts that should be in the same data center with only marginal hop count.
@Tarun-ASfS just to clarify, are you recommend_ing_ that we raise a Sev B service request? Or are you saying that per @northrup 's recommendation you did raise a Sev B service request? The latter would be more helpful :-)
No offence @Tarun-ASfS ... but if the only service we get is "I'd recommend that you open a ticket", where does the benefit of this partnership come into play?
@stanhu I have opened a Sev B ticket in the portal to investigate this issue.
@northrup FYI : I am authorized to raise tickets on your behalf only when the Azure portal is down.
On the partnership level, I(SDM) can assist you with lite advisory questions and can get you connected to the right engineering folks on deep Advisory/Architectural review etc. I(SDM) can work with product teams with your feedback on product features or support and get you updates from them. We can also work on getting you access to certain Private Previews to name a few on Azure and last but not the least we get you traction on all your service requests. There are other aspects to the support which are not just limited to the above and based on your requests we can help you on need basis.
As this is a pilot program things are changing rapidly and we take your feedback and suggestions very seriously and hence are also working on getting access to SDMs to raise cases on behalf of customer but that would take sometime, I have no ETA on it yet.
@northrup@stanhu Let me know the SR# for the Sev-B case that you have created so I can monitor/expedite as needed.
Also one of the primary reason why we ask customers to submit a case through the portal instead of SDMs/TAMs creating the SR is because the Support Engineers get your subscription details, resource details and other telemetry when you create a SR through the portal directly - this greatly improves the support engineer's ability to resolve a case quickly versus if an SDM/TAM create a SR on behalf of the customer through our internal tools.
It sounds to me like the problem has significant been improved by changing the NFS timeouts, but the problem still is happening occasionally. Look at this graph:
It seems to me we need to:
Follow up with Azure to check the network connectivity between our nodes and the NFS nodes
Investigate why we are still seeing NFS timeouts, perhaps tune settings
I just went and checked and the web workers have not been recycled, so they still have timeouts of 5 seconds, while the git workers have timeouts of 1 second. I'm going to correct these for the fleet and then cycle the git workers (again!).
I just scanned through this thread. We talked about the Repository#exists? method a lot. Would it make sense to stop trusting the Rails cache when it says Repository#exists? is false?
It feels to me like we have not discussed the unfortunate fact that one NFS glitch can put a repository in a state where it 'does not exist' until an administrator fixes that in the Rails console. I'd like it better if the application recovered on its own.
Some ideas:
only cache the positive result ("yes the repository exists") but always check on disk if the cache says 'false'
cache the negative result but use a shorter expiry time (1 minute?)
Is it possible for a project to transition from 'has a repository' to 'does not have a repository'? We know some projects have no repository on purpose, because users chose to only use the issue tracker.
@jacobvosmaer-gitlab that's an interesting thought, would it make sense to check what the impact would be of not trusting the cache for the false case? (my gut feeling is that it should not be a huge impact)
I think @yorickpeterse raised that same idea @jacobvosmaer-gitlab a while ago. I forgot the conclusion of that, but your proposal sounds like a good idea.
@jacobvosmaer-gitlab A cache should never return false, so the additional check shouldn't be necessary. For existing setups we can either (whichever is better):