Add API call to flush caches of a namespace or project
We are doing this with a rails console for when the cache gets poisoned.
We should just formalize it with an API call so we can do it through chatops.
The calls we are using are like this:
projects = Namespace.find_by(name: 'namespace')
projects.map(&:repository).map(&:expire_all_method_caches)
cc/ @DouweM
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Maintainer
Will this be restricted to admins? Seems like it should be so we don't allow any user to clear the cache on their project.
- Author Maintainer
@dblessing yes, I see this as an admin only feature, something we can call with ChatOps.
- Maintainer
@pcarranza How often do we do something like this?
- Maintainer
@DouweM Quite often. Caches are notoriously fickle.
- username-removed-274314 added AP1 label
added AP1 label
- Developer
@pcarranza can you clarify the availability or performance gain that let this issue deserve an AP1 label?
- Author Maintainer
@ernstvn this one would prevent generating issues of the "perceived data loss" kind.
- Developer
Thanks @pcarranza we're seeing those ~1-2 times per month right? (only counting the variety that would be resolved by this).
- Author Maintainer
@ernstvn I have the feeling that by @dblessing comment (https://gitlab.com/gitlab-org/gitlab-ce/issues/34265#note_33501858) there is a much higher load on this than what we are aware of in the production team.
Maybe @lbot could give us some visibility on the frequency of these events?
@pcarranza I'd say we've seen this happen somewhere between 10-20 times on .com. That said, I haven't seen this happening on the EE on-prem side.
I'm personally of the mindset that getting SWAT up and running so we could make this a function there would probably be the better "first" iteration then getting it into the API. Then we find the root cause of this in the app and make that more robust. If ultimately after SWATTING it enough we say, yes, this goes in the API because we can't make the app more robust, sure.
API means support as a tool to our customers and while I think it's useful it's also debt that I want to weigh appropriately.
- Developer
@ernstvn since this "popped up" which I would say was a few weeks ago. @dblessing can confirm as he's the one that runs these (@markglenfletcher too) but in the past few weeks I've noticed this happening.
- Author Maintainer
I'm using this script right now to blow all the caches after the NFS disaster:
irb(main):020:0* Gitlab::Redis.with do |redis| irb(main):021:1* cursor = '0' irb(main):022:1> irb(main):023:1* loop do irb(main):024:2* cursor, keys = redis.scan( irb(main):025:3* cursor, irb(main):026:3* match: 'cache:gitlab:exists?:*', irb(main):027:3* count: 1000 irb(main):028:3> ) irb(main):029:2> irb(main):030:2* redis.del(*keys) if keys.any? irb(main):031:2> irb(main):032:2* removed += keys.length irb(main):033:2> irb(main):034:2* break if cursor == '0' irb(main):035:2> end irb(main):036:1> end => nil irb(main):037:0> irb(main):038:0* puts "Removed #{removed} keys" Removed 1015632 keys
Same thing, but at a massive scale, that results in downtime.
Edited by username-removed-274314 - yorickpeterse-staging changed milestone to %Backlog
changed milestone to %Backlog
- Ernst van Nierop added Platform label
added Platform label
- Developer
I suspect this fits with Platform, so I added the label. However I'm also changing the AP1 label to SL2. The AP labels really only pertain to availability and performance, per https://about.gitlab.com/handbook/engineering/performance/#performance-labels . Issues that tie to data exposure or (perceived) data loss fit in the Security categories described on https://about.gitlab.com/handbook/engineering/security/#security-priority-labels . I'm making it SL2 instead of SL1 since it isn't actual exposure or loss, only perceived .
- Ernst van Nierop added SL2 and removed AP1 labels
- Ernst van Nierop changed milestone to %Next 2-3 months
changed milestone to %Next 2-3 months
- Developer
Changing milestone to be a bit sooner than Backlog since that refers to stuff that is 6 months out per https://about.gitlab.com/handbook/product/#planning-for-future-releases
- Maintainer
@ernstvn I think that with https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 landing in 9.5, a lot of cache issues caused by NFS troubles should be "fixed". It closes https://gitlab.com/gitlab-org/gitlab-ce/issues/33117, https://gitlab.com/gitlab-com/infrastructure/issues/1946 and https://gitlab.com/gitlab-com/infrastructure/issues/1775.
- Developer
9.5 is out; is this still a concern?
- Author Maintainer
I think it shouldn't be anymore
- username-removed-274314 closed
closed