Prioritisation of Performance & Availability issues

We used to have P1/2/3 labels but they were removed at some point, I can't really remember why though.

A difficult thing here is that performance issues are either: super critical, or all equally important. For example, regressions usually fall in the first category.

Instead of using priorities I wonder if we can somehow "tag" issues indicating what percentage of users (let's use GitLab.com as a reference for that) is affected by it. That's a bit more useful than "P1". The idea here would be to start with what affects the most users, which we can measure by seeing how many requests are sent to the controller associated with the issue.

So in other words, I propose focusing on the number of impacted users and not so much on an arbitrarily defined priority.

The dashboard https://performance.gitlab.net/dashboard/db/daily-overview?orgId=1 can be used to see how many requests go to a certain controller in the last 48 hours.

To make it a bit easier to see things I updated this dashboard so you can filter by controller.

Let's take an example: https://gitlab.com/gitlab-org/gitlab-ce/issues/27385 The action for this is Projects::Settings::MembersController#show. The Grafana URL for this is https://performance.gitlab.net/dashboard/db/daily-overview?orgId=1&var-process_type=rails&var-action=Projects::Settings::MembersController%23show&var-database=Production. Here we can see the following:

Amount of requests in the past 48 hours: 751
Mean response time: 1.44 sec
95th percentile response time: 3.53 sec
99th percentile response time: 7.28 sec

If we were to only look at the number of requests I'd say this issue is not that important as 751 requests is only 0.00065% of the total number of requests per 24 hours (roughly).

In other words, fixing this isn't unimportant but it will only affect a tiny fraction of the number of requests. I'd rather work on issues based on the affected requests, in descending order.

To make labelling easier I envision 3 labels:

Low Impact
Medium Impact
High Impact

Translating this directly to request counts is a bit tricky (e.g. a lot of our traffic is API traffic for CI, so it's very hard to ever get a controller to go above 50% of the daily total), but at least we can use the request count as an indicator.

To give an idea: our most requested controller is Projects::GitHttpController#info_refs.txt with 725 000 requests. This is only 0.6% of the daily total (1440 minutes x 80 000 requests per minute roughly = 115 200 000 requests per 24 hours)

@yorickpeterse the Gitaly team were trying to figure this out exactly, and what they did is to take countx p99 time as the metric to sort on. This makes sense to me... How hard would it be to add a column to the table that lists this "impact" metric?

We can then certainly use a numerical cutoff to define P1, P2, P3. I am strongly in favor of quantified labels, and they need to be logarithmically separated, i.e. P1 needs to be 2x or 3x or 10x more important than P2, otherwise it just leads to saturation of P labels.

@mydigitalself I started proposing something quantifiable in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019 , read the discussion there, we decided against it... but that was for use inside the Production Team ... it may make sense to revive it, but using it only and explicitly for items that need escalation beyond the Production Team.

WDYT @pcarranza ?

Gitaly approach: https://gitlab.com/gitlab-org/gitaly/blob/master/doc/PROCESS.md#order-of-migration and https://docs.google.com/spreadsheets/d/1MVjsbLIjBVryMxO0UhBWebrwXuqpbCz18ZtrThcSFFU/edit

@ernstvn I really dislike labels such as "P1" because their names are not descriptive. When we still used those in the past every time I used them I first had to look up what exactly they meant. I prefer something more verbose like "High priority" or "Priority: high".

and what they did is to take countx p99 time as the metric to sort on. This makes sense to me

I'm not entirely sure. This could end up prioritising a controller over another even if the impact is lower. For example, consider these two examples:

Controller A: 2 requests, p99 of 5 seconds = "impact" of 10 sec
Controller B: 10 requests, p99 of 500 ms = "impact" of 5 sec

Here technically A is slower, but it's used less than B; thus solving A would probably have a smaller impact compared to solving B first. Apdex would be useful, but we can't automatically calculate it using InfluxDB (and doing it manually is too annoying).

To me, "high priority" does not convey enough information, so unless we codify what qualifies as "high priority", its value is eroded over time to the point where everything becomes high priority since that's the label that seems to get attention. But then if you do codify what it means to qualify as high priority, I don't care whether you call it High Priority or P1... that's just the name you give the beast.

Your example is correct, and I stand by the reasoning that Controller A would need to be addressed before Controller B. BUT, it would be informative to look also at p50, maybe p90, since p 99 can be truly outlying behavior instead of the behavior most commonly seen by the bulk of users. AND it is also fair to consider the "cost" of the work... so that whatever scores highest on (impact)/(cost) is done first. We followed that approach in the Risk Assessment, but it seems overkill in this context, at least overkill when done in detail...

Bottom line(s):

I'm in favor of helping @mydigitalself and the rest of the company understand relative priorities through a quantified metric.
My proposal for the metrics was started in www-gitlab-com!6019 (merged) but would need to be changed to be more specific to availability and performance.
For performance, I favor the count x timing metric, whether we settle on p50, p90, or p99.

@ernstvn @yorickpeterse whilst I tend to agree with you, we have a precedent already with security SL1 etc... and support SP1 etc... so from a planning perspective we always know to look at the 1's first.

I don't mind if the definitions have some metric timings and impacts as part of the definition, but it would just make life pretty simple and understood if we have PP1 PP2 & PP3.

What does PP1 stands for @mydigitalself

I'm with @yorickpeterse regarding the names being descriptive, but I also see the value for scheduling in an async fashion.

Regarding prioritizing, when I created the initial batch of issues the reasoning was: sorted by the amount in descending order. If the p99 is over the SLA we want to achieve (1s), create an issue. If the mean is over the SLA, then it should be a high priority.

@pcarranza

What does PP1 stands for @mydigitalself

We get to define that Pablo :-)

If the p99 is over the SLA we want to achieve (1s), create an issue.

I assume you are referring specifically to issues here that relate to Rails controller timings, whether git, sql, nfs, or cache timings? But there are other issues that are about performance and availability of course that don't yet let themselves be measured so easily.

Even so, the sentence above indicates that you do have a way to quantify priority by how slow something is performing . What would be the equivalent way to quantify prioritizing issues that affect availability? Once we settle on the metrics for those two attributes, we can quantify :all_the_things: (emoji not working) and prioritize in a fully data-driven way.

@ernstvn

I assume you are referring specifically to issues here that relate to Rails controller timings, whether git, sql, nfs, or cache timings?

Correct, I was describing what I did when I created a bunch of performance issues.

But there are other issues that are about performance and availability of course that don't yet let themselves be measured so easily.

Correct, that is why I try to understand if what we are doing here is separating priorities by teams, like SP1 for Security Priority 1 (I'm assuming that that is what it stands for) and PP1 for Production Priority 1, in which case we should also add BP1 for Build Priority 1 and maybe GP1 for Gitaly Priority 1. Then we find ourselves in the crossroad of which one is a higher priority? build? gitaly? production? security?

the sentence above indicates that you do have a way to quantify priority by how slow something is performing . What would be the equivalent way to quantify prioritizing issues that affect availability? Once we settle on the metrics for those two attributes, we can quantify :all_the_things: (emoji not working) and prioritize in a fully data-driven way.

I do see the value on having a single query to prioritize the work that should be picked in the next milestone, basically because this is the way people is used to work. It may be performance or availability or stability, but in general I would like to have a tool to say: gimme this ASAP because it's causing pain.

Talk the same language.

@pcarranza

in which case we should also add BP1 for Build Priority 1 and maybe GP1 for Gitaly Priority 1. Then we find ourselves in the crossroad of which one is a higher priority? build? gitaly? production? security?

It's possible that it snowballs to that. But I don't think that is as much a problem at this point compared to the jungle of issues without prioritization. In any case we can also reduce the noise by using PP labels for anything related to availability and performance, as I had started drafting in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019. It can be used by the Gitaly team, the Build team... any team. If we need to rename the label to AP or PA to stand for "Availability & Performance" in order to make it clear that its not intended to be constrained to be used only by one team, then we can do that.

Can you take another look at https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019 and give your opinion on using those metrics (or similar ones) for an AP prioritization label?

@ernstvn I'm good with labeling things like AP standing as Availability and Performance, as those are our core concerns now.

Shall we start?

cc/ @mydigitalself

@pcarranza did you mean AP1 AP2 etc... or just AP as a general category?

Then we find ourselves in the crossroad of which one is a higher priority? build? gitaly? production? security?

That's just my life in general. I have to prioritise strategic/direction initiatives vs improvements vs customer requests vs customer bugs vs new ee features vs production/performance issues.

All I'm saying is that, without being as close to the production environment as you folks are, is that I'm struggling to determine between your priorities and what's more important for you that we can work on.

@mydigitalself I mean AP1, AP2, APN

Or at least that's what I think makes sense for you, please correct me if I'm wrong.

@pcarranza ah cool - that's perfectly fine, thanks!

@pcarranza objections to using what was outlined in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019 as the starting point for the MR that closes this issue?

@ernstvn I would like a simpler approach to only have 1 label type for the sake of simplicity, we can make it more complex if we need to.

Does that make sense @mydigitalself or would you prefer to have multiple labels that each one entails a different meaning, as @ernstvn proposed in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019

@pcarranza

only have 1 label type for the sake of simplicity

Right, there is only one label type, namely AP, levels 1-3. In the merge request I use shorthand for urgency and impact to help pick the right level for the AP label, but one is not supposed to just use U1 and I2, but the resulting AP1 instead.

Are we misunderstanding each other?