www-gitlab-com unaccessible

changed title from {-www-gitlab-com-} unaccessible to www-gitlab-com unaccessible

added app bug label

@stanhu who could we assign to perform investigations from the application side?

@ernstvn I think it's critical to find an owner for this issue from the application side and bring it to a conclusion.

In Slack, I see this and I can't help but feel sad

I have seen ‘the target branch does not exist’, ‘the source branch does not exist’ so many times I don’t notice anymore

that’s all cache poisoning I think

This is the broken window effect, we are so used to things failing that we stop noticing them.

A blanket 'fix', to repair other poisoned caches, could be to run gitlab-rake cache:clear. This command uses Redis SCAN to find and delete all Rails cache keys from Redis.

It doesn't fix the application problem but at least it does not rely on users reporting problematic repositories to us. I think it is the simplest thing that fixes the symptoms.

@jacobvosmaer-gitlab we can't do this anymore because we have a 55G Redis instance that is keeping everything. If we issue such command Redis gets out of sync triggering a failover which will end up in 15m downtime.

I opened this issue asking to split cache from persistent data in Redis, when that happen we can just blow the cache Redis instance and wipe the cache that way.

How the cache got expired:

project = Project.find_by_full_path('gitlab-com/www-gitlab-com')
project.repository.expire_all_method_caches

Related to this were the 500 errors on attempting to view an MR in www-gitlab-com. Scanning through Sentry, I wasn't able to find them immediately, but I did come across this, which may be related:

https://sentry.gitlap.com/gitlab/gitlabcom/issues/27337/

A Gitlab::Git::Repository::NoRepository error, for the URL: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/2893

@dblessing Might this be related to the problem you recently described somewhere on Slack, that customers and users are reporting too?

@grzesiek Yes, it's the same problem. https://gitlab.com/gitlab-org/gitlab-ce/issues/31237.

@jacobvosmaer-gitlab A blanket fix would be great, we are getting reports in Zendesk.

@jacobvosmaer-gitlab I hate myself for saying this, but...

Should we run this in a for to blow all the method caches? could we decrease the scope somehow? could we make it slow enough so it doesn't impact redis badly?

cc/ @ernstvn @stanhu @grzesiek @andrewn @omame

Clearing the cache may actually exacerbate the problem because I think there is a node out there that is improperly setting the cache state.

Earlier this week, we ran a test clearing out the exists? state one project at a time every second: https://gitlab.com/gitlab-com/infrastructure/issues/1775#note_29206587

This may have caused more projects to "disappear" since one node may have incorrectly set the exists? cache to false.

One way to debug this would be to log the hostname that set the exists? flag to false, but that may happen in many places.

Another possibility would be to monitor the Redis traffic and see which node is doing this.

This command might be worth running on the Redis master:

/opt/gitlab/embedded/bin/redis-cli -a <password> monitor | grep exists? | grep setex | grep -v valueT

The keys that show up there are in the form:

cache:gitlab:exists?:namespace/project:project_id

We would need then to need to check whether the repository is there via the Rails console:

p = Project.find(project_id)
p.repository.exists?
system('ls -al #{p.repository.path}')

If we find a false positive, then the IP in question is there.

We may need to isolate this further by trying to focus on one repository where we repeatedly push new tags and see if we can make exists? become false.

mentioned in issue #1776

@stanhu beware of the performance penalty of running MONITOR:

https://redis.io/commands/monitor#cost-of-running-a-hrefcommandsmonitormonitora

In this particular case, running a single MONITOR client can reduce the throughput by more than 50%. Running more MONITOR clients will reduce throughput even more.

@stanhu Will p.repository.exists? create too much additional IO? It's expensive to initialize a repository object isn't it?

Will p.repository.exists? create too much additional IO? It's expensive to initialize a repository object isn't it?

I think if it's done by hand a few times here and there, then it's okay. We're not doing a scan of all repositories at the moment.

Right. I meant if we ran a loop looking through the cache, like you showed above. I'm running the #expire_content_cache on repositories as needed to fix people up

It looks like @andrewn is right; we can't use MONITOR since that may slow things down too much.

Here's what I suggest:

Apparently, we expire the exists? tag if we push a new tag to a repo. Perhaps we should try to reproduce the problem by:

Adding a Redis WATCH for the exists? key for a test project
Push a new tag to a repo
Load the repo in the Web page
Repeat until we get a false value

I'm also wondering if all these NFS issues recently are related. For example, if the application hits some I/O error opening the repository for some reason, I wonder if the exists? state may incorrectly be set to false.

I also agree this Sentry error looks really suspicious: https://sentry.gitlap.com/gitlab/gitlabcom/issues/27337/

Notice this error only started recently, around May 7, when we changed the infrastructure:

Is it possible that the Web workers are timing out with the NFS server and unmounting them? There are recent messages such as:

May 11 15:58:57 worker-web01.fe.gitlab.com kernel: [515768.432036] nfs: server 10.70.2.103 not responding, timed out

I'm wondering if they get remounted after a chef-client run.

This is still happening: https://gitlab.com/gitlab-com/support-forum/issues/1813#note_29321056

The comments there suggest the SSH workers are losing the mount, but not the Web workers. It's probably happening on both, though.

It appears we changed our NFS timeouts here: https://gitlab.com/gitlab-com/infrastructure/issues/1620#note_27665909

Here's what I propose we do:

Restore the NFS timeouts back to what they were
Run a Redis clear for keys that have exists?

We could do a full cache clear of the exists? key, but I worry that will cause too many changes on our Redis servers. See https://gitlab.com/gitlab-com/infrastructure/issues/1775#note_29206587 for a script that clears one repo per second. We could increase the rate.

The alternative is to look at each key and clear the state if it is false use a Redis multi-read. I don't quite understand the serialization format, but maybe it's enough to look for the string valueF.

@rspeicher can you help with the second item?

I modified Stan's script from https://gitlab.com/gitlab-com/infrastructure/issues/1775#note_29206587 to scan in batches of 5,000 instead of 1,000, and a delay of 0.3 instead of 1 second.

It's currently running in a tmux session on worker-web01 and I'm monitoring load on the current Redis master.

Edit: Bumped to 10,000 / 0.15 after things looked stable.

@stanhu

Restore the NFS timeouts back to what they were

was there a corresponding MR with that change to the timeout?
if the NFS queues are too large under P99 scenario
1. is it time to increase the number of NFS servers to spread the load?
2. is there a proactive policy regarding scaling of the NFS server fleet? (based on: capacity, latency)

/cc @ernstvn

@techguru, no need to cc me on comments / questions you make unless you would like me to respond to the comment / question.

@stanhu would it be reasonable to address the following:

monitor timeout value, actual timeouts, for NFS mounts
add alerts to monitoring NFS on all client roles to include [latency of R/W queues]
add specific new dashboard panels here for NFS on clients - latency/iowait
is the NFSD data exported to prometheus from the NFS server nodes?

I imagine that prometheus is already configured to export NFS data from the FE roles?

related #1620 (closed)

changed milestone to %WoW ending 2017-05-17

Somewhat related to @stanhu's Sentry exception graph, above:

Searching for message:nfs AND message:"not responding" in Kibana yields the following:

https://log.gitlap.com/goto/f7623231ffb063d5c24abac83e376822

The majority of these messages appear to be coming from one host:

Aside: currently the hostname field in Kibana is set to not-aggregatable. This means that it's difficult to visualise errors by host. We should fix this as it's an easy way to spot problems.

The problematic timeout (timeo=10 == a pretty reasonable 1 sec) was increased yesterday to timeo=50 (5 sec) here and @northrup did a rolling reboot of our nfs clients to be sure that the new settings are active. This should definitely mitigate the timeouts.

i just checked the git hosts, and which the new setting of 5(!) seconds 3/5 servers still showed problems reaching shares:

sudo grep "not responding, timed out" /var/log/syslog|awk {'print $1" "$2" "$3" "$9'}|uniq -c

count - time - nfs server

git02:

    962 May 12 08:15:24 10.70.2.105
     55 May 12 08:15:27 10.70.2.105

git04:

      5 May 12 08:15:27 10.70.2.105
      1 May 12 08:15:28 10.70.2.105

git05:

      8 May 12 07:08:25 10.70.2.106
    984 May 12 07:08:26 10.70.2.106

It looks as though the network has momentary lapses, which affect some servers more than others.

We COULD increase the timeouts further (say to 10 seconds).

Do we need to talk to Azure about these network issues?

@stanhu When in doubt, ask @Tarun-ASfS

RE https://gitlab.com/gitlab-com/infrastructure/issues/1782#note_29334978 -- it's been running for about 22 hours now.

I would ask Azure at this point as this is abnormal network response time for hosts that should be in the same data center with only marginal hop count.

@northrup Recommended to raise a Sev B Service Request to Azure on this issue. We will investigate.

@Tarun-ASfS just to clarify, are you recommend_ing_ that we raise a Sev B service request? Or are you saying that per @northrup 's recommendation you did raise a Sev B service request? The latter would be more helpful :-)

@ernstvn I am recommending that you raise a Sev B on it, I (SDMs) cannot raise an SR on your behalf with the right SLAs

No offence @Tarun-ASfS ... but if the only service we get is "I'd recommend that you open a ticket", where does the benefit of this partnership come into play?

@stanhu I have opened a Sev B ticket in the portal to investigate this issue.

@northrup FYI : I am authorized to raise tickets on your behalf only when the Azure portal is down. On the partnership level, I(SDM) can assist you with lite advisory questions and can get you connected to the right engineering folks on deep Advisory/Architectural review etc. I(SDM) can work with product teams with your feedback on product features or support and get you updates from them. We can also work on getting you access to certain Private Previews to name a few on Azure and last but not the least we get you traction on all your service requests. There are other aspects to the support which are not just limited to the above and based on your requests we can help you on need basis. As this is a pilot program things are changing rapidly and we take your feedback and suggestions very seriously and hence are also working on getting access to SDMs to raise cases on behalf of customer but that would take sometime, I have no ETA on it yet.

@northrup @stanhu Let me know the SR# for the Sev-B case that you have created so I can monitor/expedite as needed.

Also one of the primary reason why we ask customers to submit a case through the portal instead of SDMs/TAMs creating the SR is because the Support Engineers get your subscription details, resource details and other telemetry when you create a SR through the portal directly - this greatly improves the support engineer's ability to resolve a case quickly versus if an SDM/TAM create a SR on behalf of the customer through our internal tools.

@Tarun-ASfS thanks for explaining the reasoning from your end.

It is SR 117051215732466

Funnily enough, your comment

the Support Engineers get your subscription details, resource details and other telemetry when you create a SR through the portal directly

does not jive so well with the initial response @northrup received, which was (amongst other text)

Are the machines you are running Linux or Windows?

😄

@ernstvn We are working on making things more easier for you, I discussed this scenario internally, will have a better news for you sooner. :)

Thanks @Tarun-ASfS I appreciate it! I also appreciate your prompt replies in this thread by the way :-)

Same problem is now happening on the Grafana Dashboards repo https://gitlab.com/gitlab-org/grafana-dashboards

mentioned in issue support-forum#1828 (closed)

Not clear to me what the next step should be in this one? @northrup ? @stanhu?

changed milestone to %WoW ending 2017-05-23

mentioned in issue #1819 (closed)

It sounds to me like the problem has significant been improved by changing the NFS timeouts, but the problem still is happening occasionally. Look at this graph:

It seems to me we need to:

Follow up with Azure to check the network connectivity between our nodes and the NFS nodes
Investigate why we are still seeing NFS timeouts, perhaps tune settings
Add better monitoring where needed (https://gitlab.com/gitlab-com/infrastructure/issues/1782#note_29340667)

@northrup mentions on Slack, "It would appear that we lost work that was done... I had them all at 5 seconds with https://dev.gitlab.org/cookbooks/chef-repo/commit/fa0e14bb21004a181e5c1bedc7c999c4cdd7b20f and @jtevnan dropped them all back to 1 second when we redid the base roles"

We need to fix this ASAP.

@stanhu, there might be something interesting if you break those "nfs server not responding" warnings down by host...

https://log.gitlap.com/goto/cb91c74acaf7dfea2abbd5a1e4bcf426

The first wave of errors are across a whole range of different servers, but the recent ones are almost exclusively occurring on git workers.

I just went and checked and the web workers have not been recycled, so they still have timeouts of 5 seconds, while the git workers have timeouts of 1 second. I'm going to correct these for the fleet and then cycle the git workers (again!).

mentioned in issue #1828 (closed)

@ahanselka @northrup completed the recycle. What remains at this point?

added storage and removed app bug labels

@pcarranza it seems this one can be closed now after the recycle + with all the work on redis. ?

I'm closing, this is unrelated to redis.

closed

assigned to @northrup, @omame, and @jtevnan

I just scanned through this thread. We talked about the Repository#exists? method a lot. Would it make sense to stop trusting the Rails cache when it says Repository#exists? is false?

It feels to me like we have not discussed the unfortunate fact that one NFS glitch can put a repository in a state where it 'does not exist' until an administrator fixes that in the Rails console. I'd like it better if the application recovered on its own.

Some ideas:

only cache the positive result ("yes the repository exists") but always check on disk if the cache says 'false'
cache the negative result but use a shorter expiry time (1 minute?)

Is it possible for a project to transition from 'has a repository' to 'does not have a repository'? We know some projects have no repository on purpose, because users chose to only use the issue tracker.

@jacobvosmaer-gitlab that's an interesting thought, would it make sense to check what the impact would be of not trusting the cache for the false case? (my gut feeling is that it should not be a huge impact)

I think @yorickpeterse raised that same idea @jacobvosmaer-gitlab a while ago. I forgot the conclusion of that, but your proposal sounds like a good idea.

@stanhu We've had a few cases now, let's do what was proposed:

When already cached, just return it
When not cached, check the value
If true, cache it
If false, don't cache it

@yorickpeterse how about:

if cached and true, return true
check
if true, cache and return true
return false

That way we also self-heal broken cached values out there in the wild.

would it make sense to check what the impact would be of not trusting the cache for the false case?

Sounds like a perfect use case for prometheus.

@jacobvosmaer-gitlab

Sounds like a perfect use case for prometheus.

What do you mean? how would this look like in prometheus?

@jacobvosmaer-gitlab A cache should never return false, so the additional check shouldn't be necessary. For existing setups we can either (whichever is better):

Schedule a migration to flush the "exists" cache
Just wait since caches expire after 2 weeks

I've created gitlab-org/gitlab-ce#33117 to implement this. Feel free to follow the discussion there.

www-gitlab-com unaccessible

Designs

Child items 0

Activity

Admin message

Admin message

www-gitlab-com unaccessible

Activity