diff --git a/source/handbook/engineering/performance/index.html.md b/source/handbook/engineering/performance/index.html.md index a23cf20c456947d173deff01dd50a78381476bc0..48261bb2d984d02702657da1d1dee2f7b6632619 100644 --- a/source/handbook/engineering/performance/index.html.md +++ b/source/handbook/engineering/performance/index.html.md @@ -160,6 +160,51 @@ revision, new revision, and ref (e.g. tag or branch) name. 1. Sidekiq updates PostgreSQL 1. Unicorn can now query PostgreSQL. + +## Availability and Performance Priority Labels +{: #performance-labels} + +To clarify the priority of issues that relate to GitLab.com's availability and +performance consider adding an _Availability and Performance Priority Label_, +`~AP1` through `~AP3`. This is similar to what is in use in the Support and +Security teams, they use `~SE` and `~SL` labels respectively to indicate +priority. + +Use the following as a guideline to determine which Availability and Performance +Priority label to use for bugs and feature proposals. Consider the _likelihood_ +and _urgency_ of the "scenario" that could result from this issue (not) being +resolved. + +- **Urgency:** _Examples_ + - U1 + - Outage likely within a month. + - Affects many team members and/or many GitLab.com users + - U2 + - Outage likely within three months. + - Affects some team members and/or a few GitLab.com users + - U3 + - Outage can happen, but not likely in next three months. + - Affects some team members but no GitLab.com users + +- **Impact:** _Examples_ + - I1 + - Outage of >= 25 minutes. + - Performance improvement (or avoiding degradation) of >= 100 ms expected. + - I2 + - Outage of 5 - 25 minutes. + - Performance improvement (or avoiding degradation) of 10-100 ms expected. + - I3 + - Outage of 0 - 5 minutes. + - Performance improvement (or avoiding degradation) of <= 10 ms expected. + + +| **Urgency \ Impact** | **I1 - High** | **I2 - Medium** | **I3 - Low** | +|----------------------------|---------------|------------------|----------------| +| **U1 - High** | `AP1` | `AP1` | `AP2` | +| **U2 - Medium** | `AP1` | `AP2` | `AP3` | +| **U3 - Low** | `AP2` | `AP3` | `AP3` | + + ## Database Performance Some general notes about parameters that affect database performance, at a very diff --git a/source/handbook/infrastructure/index.html.md b/source/handbook/infrastructure/index.html.md index 7af4efda640460c8db7511c3542e886778d13109..8f5847411192314882e167f3ef754f757bc3f268 100644 --- a/source/handbook/infrastructure/index.html.md +++ b/source/handbook/infrastructure/index.html.md @@ -28,6 +28,7 @@ title: "Infrastructure" - [GitLab.com architecture](production-architecture/) - [Monitoring GitLab.com](monitoring/) +- [Performance of GitLab.com](/handbook/engineering/performance) - [Database team handbook](database/) - [Gitaly team handbook](gitaly/) - [Production team handbook](production/) @@ -46,7 +47,7 @@ infrastructure team works on is in fact the first issue in the gitlab-ce issue tracker; for more on pingdom see the [monitoring page](/handbook/infrastructure/monitoring/)), measured per calendar month, and as recorded on - [pingdom](http://stats.pingdom.com/81vpf8jyr1h9/1902794/history). + [pingdom](http://stats.pingdom.com/81vpf8jyr1h9/1902794/history). 1. GitLab.com's performance. - Current goal: [99% of user requests < 1 second](https://performance.gitlab.net/dashboard/db/transaction-overview?panelId=2&fullscreen&orgId=1) - Latency here is _currently_ measured via the "Transaction Timings" @@ -270,6 +271,7 @@ in the invite, or get in touch in the production chat channel to ask. Any team or individual can initiate a change to GitLab.com by following this checklist. Create an issue in the infrastructure [issue tracker](https://gitlab.com/gitlab-com/infrastructure/issues) and select the `change_checklist` template + ## Make GitLab.com settings the default As said in the [production engineer job description](jobs/production-engineer/index.html) diff --git a/source/handbook/infrastructure/production/index.html.md b/source/handbook/infrastructure/production/index.html.md index 8cefbea3e72ae96c448c58542983cb5a7455cf3d..783688c5ef50d16fcdb67d45e3f2852f87548393 100644 --- a/source/handbook/infrastructure/production/index.html.md +++ b/source/handbook/infrastructure/production/index.html.md @@ -41,25 +41,26 @@ own time as the main scarce resource. 1. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise. 1. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way. -## Prioritizing Issues +## Workflow -Given the variety of responsibilities and number of "interfaces" between the Production -team and all the other teams at GitLab, here is a guideline on how to prioritize -the issues we work on. Basing this on the [goals of the Infrastructure team](../#infragoals) as -well as our [values](/handbook/values/) and [workflows](/handbook/engineering/workflow) -as a company as whole, the priority should be: +### Workout of the Week (WoW) Milestone -1. keeping GitLab.com available - and secure -1. unblocking others -1. automating tasks to reduce toil and increase _team_ availability (but be - explicit about the [costs](https://xkcd.com/1319/) and [benefits](https://xkcd.com/1205/) -1. improving performance of GitLab.com while being conscious of cost -1. reducing costs of running GitLab.com +Issues in the tracker are organized into [milestones](https://gitlab.com/gitlab-com/infrastructure/milestones) +to define the "workout of the week" (WoW) from one week to the next. The "week" +runs from Wednesday to end of Tuesday. The other milestone in use is "Next WoW" +to track items scheduled for the next week. Every week, the Production Lead +renames the WoW to "WoW ending yyyy-mm-dd", and closes it; then renames "Next +WoW" to "WoW". By doing this, the closed milestones provide a history of what +the team has worked on, while the team only needs to be concerned with two open + milestones. If issues are added to the "WoW" after the week has already + started, add the `~unscheduled` label (not needed if the issue is `~outage` + since those are by definition unscheduled). ### Labeling Issues -We use [issue labels](https://gitlab.com/gitlab-com/infrastructure/labels) to -assist in organizing issues within the Infrastructure issue tracker. Prioritized labels are +We use [issue labels](https://gitlab.com/gitlab-com/infrastructure/labels) +within the Infrastructure issue tracker to assist in prioritizing and organizing +work. Prioritized labels are: - `~(perceived) data loss` - `~critical` @@ -68,13 +69,17 @@ assist in organizing issues within the Infrastructure issue tracker. Prioritized - `~outage` - `~blocked` -### Workout of the Week (WoW) Milestone - -Issues in this tracker are organized into [milestones](https://gitlab.com/gitlab-com/infrastructure/milestones) to define the "workout of the week" (WoW) from one week to the next. The "week" runs from Wednesday to end of Tuesday. The other milestone in use is "Next WoW" to track items scheduled for the next week. Every week, the Production Lead renames the WoW to "WoW ending yyyy-mm-dd", and closes it; then renames "Next WoW" to "WoW". By doing this, the closed milestones provide a history of what the team has worked on, while the team only needs to be concerned with two open milestones. If issues are added to the "WoW" after the week has already started, add the `~unscheduled` label (not needed if the issue is `~outage` since those are by definition unscheduled). +We also use the `~AP1`, `~AP2`, `~AP3` labels as described in [availability & +performance priority labels](/handbook/engineering/performance/#performance-labels). +Those are mainly used to communicate priority of issues to Product Managers, for +scheduling purposes. ### Issue or outage hand off -Ongoing outages, as well as issues that have the `~(perceived) data loss` label and are (therefore) actively being worked on need a hand off to happen as team members cycle in and out of their timezones and availability. The on call log can be used to assist with this. (See link at top to on-call log). +Ongoing outages, as well as issues that have the `~(perceived) data loss` label + and are (therefore) actively being worked on need a hand off to happen as team + members cycle in and out of their timezones and availability. The on call log + can be used to assist with this. (See link at top to on-call log). ## Production events logging