Add API call to flush caches of a namespace or project

Will this be restricted to admins? Seems like it should be so we don't allow any user to clear the cache on their project.

@dblessing yes, I see this as an admin only feature, something we can call with ChatOps.

@pcarranza How often do we do something like this?

@DouweM Quite often. Caches are notoriously fickle.

added AP1 label

@pcarranza can you clarify the availability or performance gain that let this issue deserve an AP1 label?

@ernstvn this one would prevent generating issues of the "perceived data loss" kind.

Thanks @pcarranza we're seeing those ~1-2 times per month right? (only counting the variety that would be resolved by this).

@ernstvn I have the feeling that by @dblessing comment (https://gitlab.com/gitlab-org/gitlab-ce/issues/34265#note_33501858) there is a much higher load on this than what we are aware of in the production team.

Maybe @lbot could give us some visibility on the frequency of these events?

@pcarranza I'd say we've seen this happen somewhere between 10-20 times on .com. That said, I haven't seen this happening on the EE on-prem side.

I'm personally of the mindset that getting SWAT up and running so we could make this a function there would probably be the better "first" iteration then getting it into the API. Then we find the root cause of this in the app and make that more robust. If ultimately after SWATTING it enough we say, yes, this goes in the API because we can't make the app more robust, sure.

API means support as a tool to our customers and while I think it's useful it's also debt that I want to weigh appropriately.

@lbot

between 10-20 times on .com

over what time period?

@ernstvn since this "popped up" which I would say was a few weeks ago. @dblessing can confirm as he's the one that runs these (@markglenfletcher too) but in the past few weeks I've noticed this happening.

I'm using this script right now to blow all the caches after the NFS disaster:

irb(main):020:0* Gitlab::Redis.with do |redis|
irb(main):021:1*   cursor = '0'
irb(main):022:1> 
irb(main):023:1*   loop do
irb(main):024:2*     cursor, keys = redis.scan(
irb(main):025:3*       cursor,
irb(main):026:3*       match: 'cache:gitlab:exists?:*',
irb(main):027:3*       count: 1000
irb(main):028:3>     )
irb(main):029:2> 
irb(main):030:2*     redis.del(*keys) if keys.any?
irb(main):031:2> 
irb(main):032:2*     removed += keys.length
irb(main):033:2> 
irb(main):034:2*     break if cursor == '0'
irb(main):035:2>   end
irb(main):036:1> end
=> nil
irb(main):037:0> 
irb(main):038:0* puts "Removed #{removed} keys"
Removed 1015632 keys

Same thing, but at a massive scale, that results in downtime.

changed milestone to %Backlog

added Platform label

I suspect this fits with Platform, so I added the label. However I'm also changing the AP1 label to SL2. The AP labels really only pertain to availability and performance, per https://about.gitlab.com/handbook/engineering/performance/#performance-labels . Issues that tie to data exposure or (perceived) data loss fit in the Security categories described on https://about.gitlab.com/handbook/engineering/security/#security-priority-labels . I'm making it SL2 instead of SL1 since it isn't actual exposure or loss, only perceived .

added SL2 and removed AP1 labels

changed milestone to %Next 2-3 months

Changing milestone to be a bit sooner than Backlog since that refers to stuff that is 6 months out per https://about.gitlab.com/handbook/product/#planning-for-future-releases

@ernstvn I think that with https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11449 landing in 9.5, a lot of cache issues caused by NFS troubles should be "fixed". It closes https://gitlab.com/gitlab-org/gitlab-ce/issues/33117, https://gitlab.com/gitlab-com/infrastructure/issues/1946 and https://gitlab.com/gitlab-com/infrastructure/issues/1775.

9.5 is out; is this still a concern?

I think it shouldn't be anymore

closed

Add API call to flush caches of a namespace or project

Designs

Child items ...

Activity

Admin message

Admin message

Add API call to flush caches of a namespace or project

Activity