I would propose doing something similar to how we deal with security and customer support labels which use a 1-3 grading of severity. Issues labelled availability may not always take priority over performance depending on how frequently or likely they are to occur or if there are valid short-term workarounds for instance.
Any P1 issues we will look to schedule for the up-coming release and should be considered Deliverable
Any P2 issues we should look to address within a reasonable timeframe if not all development bandwidth is saturated by P2's
Any P3 issues will only be addressed once all P2 issues have been addressed
@DouweM@yorickpeterse@pcarranza@ernstvn What do you think of this? I want to make sure we're not just dedicated development time to "Performance & Availability" but rather being smart about how we do this as right now we really don't have a good set of criteria to make prioritisation decisions on these issues.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
A difficult thing here is that performance issues are either: super critical, or all equally important. For example, regressions usually fall in the first category.
Instead of using priorities I wonder if we can somehow "tag" issues indicating what percentage of users (let's use GitLab.com as a reference for that) is affected by it. That's a bit more useful than "P1". The idea here would be to start with what affects the most users, which we can measure by seeing how many requests are sent to the controller associated with the issue.
So in other words, I propose focusing on the number of impacted users and not so much on an arbitrarily defined priority.
If we were to only look at the number of requests I'd say this issue is not that important as 751 requests is only 0.00065% of the total number of requests per 24 hours (roughly).
In other words, fixing this isn't unimportant but it will only affect a tiny fraction of the number of requests. I'd rather work on issues based on the affected requests, in descending order.
Translating this directly to request counts is a bit tricky (e.g. a lot of our traffic is API traffic for CI, so it's very hard to ever get a controller to go above 50% of the daily total), but at least we can use the request count as an indicator.
To give an idea: our most requested controller is Projects::GitHttpController#info_refs.txt with 725 000 requests. This is only 0.6% of the daily total (1440 minutes x 80 000 requests per minute roughly = 115 200 000 requests per 24 hours)
@yorickpeterse the Gitaly team were trying to figure this out exactly, and what they did is to take countx p99 time as the metric to sort on. This makes sense to me... How hard would it be to add a column to the table that lists this "impact" metric?
We can then certainly use a numerical cutoff to define P1, P2, P3. I am strongly in favor of quantified labels, and they need to be logarithmically separated, i.e. P1 needs to be 2x or 3x or 10x more important than P2, otherwise it just leads to saturation of P labels.
@mydigitalself I started proposing something quantifiable in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019 , read the discussion there, we decided against it... but that was for use inside the Production Team ... it may make sense to revive it, but using it only and explicitly for items that need escalation beyond the Production Team.
@ernstvn I really dislike labels such as "P1" because their names are not descriptive. When we still used those in the past every time I used them I first had to look up what exactly they meant. I prefer something more verbose like "High priority" or "Priority: high".
and what they did is to take countx p99 time as the metric to sort on. This makes sense to me
I'm not entirely sure. This could end up prioritising a controller over another even if the impact is lower. For example, consider these two examples:
Controller A: 2 requests, p99 of 5 seconds = "impact" of 10 sec
Controller B: 10 requests, p99 of 500 ms = "impact" of 5 sec
Here technically A is slower, but it's used less than B; thus solving A would probably have a smaller impact compared to solving B first. Apdex would be useful, but we can't automatically calculate it using InfluxDB (and doing it manually is too annoying).
To me, "high priority" does not convey enough information, so unless we codify what qualifies as "high priority", its value is eroded over time to the point where everything becomes high priority since that's the label that seems to get attention. But then if you do codify what it means to qualify as high priority, I don't care whether you call it High Priority or P1... that's just the name you give the beast.
Your example is correct, and I stand by the reasoning that Controller A would need to be addressed before Controller B. BUT, it would be informative to look also at p50, maybe p90, since p 99 can be truly outlying behavior instead of the behavior most commonly seen by the bulk of users. AND it is also fair to consider the "cost" of the work... so that whatever scores highest on (impact)/(cost) is done first. We followed that approach in the Risk Assessment, but it seems overkill in this context, at least overkill when done in detail...
Bottom line(s):
I'm in favor of helping @mydigitalself and the rest of the company understand relative priorities through a quantified metric.
My proposal for the metrics was started in www-gitlab-com!6019 (merged) but would need to be changed to be more specific to availability and performance.
For performance, I favor the count x timing metric, whether we settle on p50, p90, or p99.
@ernstvn@yorickpeterse whilst I tend to agree with you, we have a precedent already with security SL1 etc... and support SP1 etc... so from a planning perspective we always know to look at the 1's first.
I don't mind if the definitions have some metric timings and impacts as part of the definition, but it would just make life pretty simple and understood if we have PP1PP2 & PP3.
I'm with @yorickpeterse regarding the names being descriptive, but I also see the value for scheduling in an async fashion.
Regarding prioritizing, when I created the initial batch of issues the reasoning was: sorted by the amount in descending order. If the p99 is over the SLA we want to achieve (1s), create an issue. If the mean is over the SLA, then it should be a high priority.
If the p99 is over the SLA we want to achieve (1s), create an issue.
I assume you are referring specifically to issues here that relate to Rails controller timings, whether git, sql, nfs, or cache timings? But there are other issues that are about performance and availability of course that don't yet let themselves be measured so easily.
Even so, the sentence above indicates that you do have a way to quantify priority by how slow something is performing . What would be the equivalent way to quantify prioritizing issues that affect availability? Once we settle on the metrics for those two attributes, we can quantify :all_the_things: (emoji not working) and prioritize in a fully data-driven way.
I assume you are referring specifically to issues here that relate to Rails controller timings, whether git, sql, nfs, or cache timings?
Correct, I was describing what I did when I created a bunch of performance issues.
But there are other issues that are about performance and availability of course that don't yet let themselves be measured so easily.
Correct, that is why I try to understand if what we are doing here is separating priorities by teams, like SP1 for Security Priority 1 (I'm assuming that that is what it stands for) and PP1 for Production Priority 1, in which case we should also add BP1 for Build Priority 1 and maybe GP1 for Gitaly Priority 1. Then we find ourselves in the crossroad of which one is a higher priority? build? gitaly? production? security?
the sentence above indicates that you do have a way to quantify priority by how slow something is performing . What would be the equivalent way to quantify prioritizing issues that affect availability? Once we settle on the metrics for those two attributes, we can quantify :all_the_things: (emoji not working) and prioritize in a fully data-driven way.
I do see the value on having a single query to prioritize the work that should be picked in the next milestone, basically because this is the way people is used to work. It may be performance or availability or stability, but in general I would like to have a tool to say: gimme this ASAP because it's causing pain.
in which case we should also add BP1 for Build Priority 1 and maybe GP1 for Gitaly Priority 1. Then we find ourselves in the crossroad of which one is a higher priority? build? gitaly? production? security?
It's possible that it snowballs to that. But I don't think that is as much a problem at this point compared to the jungle of issues without prioritization. In any case we can also reduce the noise by using PP labels for anything related to availability and performance, as I had started drafting in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019. It can be used by the Gitaly team, the Build team... any team. If we need to rename the label to AP or PA to stand for "Availability & Performance" in order to make it clear that its not intended to be constrained to be used only by one team, then we can do that.
@pcarranza did you mean AP1AP2 etc... or just AP as a general category?
Then we find ourselves in the crossroad of which one is a higher priority? build? gitaly? production? security?
That's just my life in general. I have to prioritise strategic/direction initiatives vs improvements vs customer requests vs customer bugs vs new ee features vs production/performance issues.
All I'm saying is that, without being as close to the production environment as you folks are, is that I'm struggling to determine between your priorities and what's more important for you that we can work on.
Right, there is only one label type, namely AP, levels 1-3. In the merge request I use shorthand for urgency and impact to help pick the right level for the AP label, but one is not supposed to just use U1 and I2, but the resulting AP1 instead.