Merge branch 'scoped-change-labels-production-project' into 'master'

cleaning up the change-management page and adding change scoped lables See merge request gitlab-com/www-gitlab-com!36900

Merge branch 'scoped-change-labels-production-project' into 'master'
23ebfc5e · Anthony Sandoval · 80e140b6 · a42563fc · 23ebfc5e
Commit 23ebfc5e authored 5 years ago by Anthony Sandoval
--- a/source/handbook/engineering/infrastructure/change-management/index.html.md
+++ b/source/handbook/engineering/infrastructure/change-management/index.html.md
@@ -11,10 +11,11 @@ title: "Change Management"
  
 # Changes
  
-Changes are any **modification to the operational environment** and are classified into two types:
+Changes are any modification to the production environment.
  
 * **Service changes** are regular, routine changes executed through well-tested, automated procedures performed with minimal human interaction that may cause predictable and limited performance degradation and no downtime. As such, service changes do not require review or approval except on their very first iteration.
-* **Maintenance changes** are possibly complex changes that require manual intervention beyond initiating the change and that will cause downtime or significant performance degradation by the nature of the change. These changes require strict scheduling, careful planning and review, and approval by the Director of Infrastructure for execution.
+
+* **Maintenance changes** are complex changes that require manual intervention and will cause downtime or significant performance degradation. These changes require strict scheduling, careful planning and review&mdash;and approval by the Director of Reliability.
  
 **Deployments** are a special change metatype depending on their scope and the effect they may have on the environment, as defined above. As we make progress towards CI/CD, we aim to turn all deployments into simple service changes.
  
@@ -44,153 +45,91 @@ Change severities encapsulate the risk associated with a change in the environme
 * ~S3 and ~S4 changes are allowed to take place concurrently as long as there is awareness of said concurrency.
 * The Infrastructure on-call resource has veto power over any and all changes.
  
-## Change Type
+## Change Plans
+
+All changes should have change plans. Planning is the way the infrastructure department assesses and mitigates the risks changes introduce. They generate awareness and are the focal point for scheduling, communicating, and recording changes. 
  
-The change types can be one of the labels :
-* `ConfigurationChange` - to deploy a configuration change.
-* `HotFix` - to deploy a patch in a software that is production.
-* `DeploymentNewFeature` - to deploy of a new feature.
-* `Operation` - operational tasks, such as running data mutating scripts, one-off cleanup/restarts, and so on.
+# Change Request Workflows
  
-## Change Plans
+Plan issues are opened in the [production](https://gitlab.com/gitlab-com/gl-infra/production/issues) project tracker. Each issue should be opened using an issue template for the corresponding level of criticality: `C1`, `C2`, `C3`, or `C4`. It must provide a detailed description of the proposed change and include all the requested information in the template. Every plan issue is initially labeled `~"change::unscheduled"` until it can be reviewed and scheduled with a Due Date. After the plan is approved and scheduled it should be labeled `~"change::scheduled"` for visbility.
  
-All changes should have change plans. While this is optional for service changes, it is mandatory for maintenance changes. **Change Plans** provide detailed description of proposed changes and include the following information depending on the criticality of the service:
+## Change Criticalities
  
-### Criticality 1:
-These are changes with high impact or high risk.  If a change is going to cause downtime to the environment, then it is implicitly a C1 (any rare exceptions to this rule will be obvious)
+### Criticality 1
+
+These are changes with high impact or high risk. If a change is going to cause downtime to the environment, it is always categorized a `C1`. Before implementing the change.
+
+**Examples of Criticality 1:**
  
-Examples of Criticality 1:
 1. Any changes to Postgres hosts that affects DB functionality - quantity of nodes, changes to backup or replication strategy
 1. Architectural changes to Infra as code (IaC)
 1. IaC changes to pets - Postgres, Redis, and other Single Points of Failure
 1. Changes of major vendor - CDN, mail, DNS
 1. Major version upgrades of tooling (HAProxy, Chef)
  
-| Change Objective                                       | Describe the objective of the change                                                                                                               |
-|:-------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
-| Change Type                                            | Type described above                                                                                                                               |
-| Services Impacted                                      | List services                                                                                                                                      |
-| Change Team Members                                    | Name of the engineers involved in the change                                                                                                       |
-| Change Severity                                        | How critical is the change                                                                                                                         |
-| Change Reviewer                                        | A colleague who will review the change                                                                                                             |
-| Tested in staging                                      | Evidence or assertion the change was tested on staging environment                                                                                              |
-| Dry-run output                                         | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
-| Due Date                                               | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change             |
-| Time tracking                                          | To estimate and record times associated with changes ( including a possible rollback )                                                           |
-| Downtime Component                                     | if yes how many minutes                                                                                                                            |
-
-Detailed steps for the change. For each step the following must be considered:
-  * pre-conditions for execution of the step - how to verify it is safe to proceed
-  * execution commands for the step - what to do
-  * post-execution validation for the step - how to verify the step succeeded
-
-It is strongly recommended to:
-  * Note relevant graphs in grafana to monitor the effect of the change, including how to identify that it has worked, or has caused undue negative effects
-  * Review alerts that may go off that can be silenced pro-actively
-
-Rollback steps
-  * As for the original steps.  It may be acceptable to reference the change steps as the process, with variations (e.g. Revert commit and run deployment).
-  * It is acceptable to list a full rollback process, and allow for the applier to select where to start based on how far through they got.
-
-
-### Criticality 2:
-These are changes that are not expected to cause downtime, but which still carry some risk of impact if something unexpected happens.  E.g. reducing the size of a fleet of cattle is usually ok because we've identified over-provisioning, but we need to take care and monitor carefully before + after.
-
-Examples of Criticality 2:
+#### Approval
+
+1. Add a Due Date to the issue and to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar.
+1. Have the change approved by Reliability Engineering management.
+1. Identify the Engineer On-Call (EOC) scheduled for the time of the change and review the plan with them.
+1. Announce the start of the plan execution in the `#production` Slack channel and obtain a written approval from the EOC in both the issue and in Slack.
+1. Join The "Situation Room" zoom channel with the EOC and obtain verbal approval to start the plan execution.
+
+The EOC must be engaged for the entire time  the execution 
+
+[Criticality 1 plan template](https://gitlab.com/gitlab-com/gl-infra/production/blob/master/.gitlab/issue_templates/change_c1.md)
+
+### Criticality 2
+
+These are changes that are not expected to cause downtime, but which still carry some risk of impact if something unexpected happens. For example, reducing the size of a fleet of cattle is usually ok because we've identified over-provisioning, but we need to take care and monitor carefully before and after.
+
+**Examples of Criticality 2:**
+
 1. Load Balancer Configuration - major changes to backends or front ends, fundamental to traffic flow
 1. IaC changes to cattle / quantity when there is a decrease
 1. Minor version upgrades of tools or components (HAProxy)
 1. Removing old hosts from IaC (like removals of legacy infrastructure)
  
-| Change Objective                                       | Describe the objective of the change                                                                                                             |
-|:-------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------|
-| Change Type                                            | Type described above                                                                                                                             |
-| Services Impacted                                      | List services                                                                                                                                    |
-| Change Team Members                                    | Name of the engineers involved in the change                                                                                                     |
-| Change Severity                                        | How critical is the change                                                                                                                       |
-| Change Reviewer                                        | A colleague who will review the change                                                                                                           |
-| Tested in staging                                      | Evidence or assertion the change was tested on staging environment                                                                                            |
-| Dry-run output                                         | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
-| Due Date                                               | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change           |
-| Time tracking                                          | To estimate and record times associated with changes ( including a possible rollback )                                                           |
-
-Detailed steps for the change. For each step the following must be considered:
-  * pre-conditions for execution of the step - how to verify it is safe to proceed
-  * execution commands for the step - what to do
-  * post-execution validation for the step - how to verify the step succeeded
-
-It is strongly recommended to:
-  * Note relevant graphs in grafana to monitor the effect of the change, including how to identify that it has worked, or has caused undue negative effects
-  * Review alerts that may go off that can be silenced pro-actively
-
-Rollback steps
-  * As for the original steps.  It may be acceptable to reference the change steps as the process, with variations (e.g. Revert commit and run deployment).
-  * It is acceptable to list a full rollback process, and allow for the applier to select where to start based on how far through they got.
+#### Approval
+
+1. Add a Due Date to the issue and an event to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar.
+1. Identify the Engineer On-Call (EOC) scheduled for the time of the change and review the plan with them.
+1. Announce the start of the plan execution in the `#production` Slack channel and obtain a written approval from the EOC in both the issue and in Slack.
+
+[Criticality 2 plan template](https://gitlab.com/gitlab-com/gl-infra/production/blob/master/.gitlab/issue_templates/change_c2.md)
  
 ### Criticality 3
+
 These are changes with either no or very-low risk of negative impact, but where there is still some inherent complexity, or it is not fully automated and hands-off
  
-Examples of Criticality 3:
+**Examples of Criticality 3:**
+
 1. IaC changes to cattle / quantity when there is an increase (not requiring reboot or destroy/recreate)
 1. Changes in configuration for current systems serving customers related to DNS or CDN
  
-| Change Objective                                       | Describe the objective of the change                                                                                                             |
-|:-------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------|
-| Change Type                                            | Type described above                                                                                                                             |
-| Services Impacted                                      | List services                                                                                                                                    |
-| Change Team Members                                    | Name of the engineers involved in the change                                                                                                     |
-| Change Severity                                        | How critical is the change                                                                                                                       |
-| Change Reviewer or tested in staging                   | A colleague who will review the change or evidence the change was tested on staging environment                                                  |
-| Dry-run output                                         | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result                                                                                                                    |
-| Due Date                                               | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change           |
-| Time tracking                                          | To estimate and record times associated with changes ( including a possible rollback )                                                           |
-
-Detailed steps for the change. For each step the following must be considered:
-  * pre-conditions for execution of the step - how to verify it is safe to proceed
-  * execution commands for the step - what to do
-  * post-execution validation for the step - how to verify the step succeeded
+#### Approval
  
-It is strongly recommended to:
-  * Note relevant graphs in grafana to monitor the effect of the change, including how to identify that it has worked, or has caused undue negative effects
-  * Review alerts that may go off that can be silenced pro-actively
+1. Add a Due Date to the issue.
+1. Identify the Engineer On-Call (EOC) scheduled for the time of the change and review the plan with them.
  
-Rollback steps
-  * As for the original steps.  It may be acceptable to reference the change steps as the process, with variations (e.g. Revert commit and run deployment).
-  * It is acceptable to list a full rollback process, and allow for the applier to select where to start based on how far through they got.
+[Criticality 3 plan template](https://gitlab.com/gitlab-com/gl-infra/production/blob/master/.gitlab/issue_templates/change_c3.md)
  
+### Criticality 4
  
-### Criticality 4:
 These are changes that are exceedingly low risk and commonly executed, or which are fully automated.  Often these will be changes that are mainly being recorded for visibility rather than as a substantial control measure.
  
-Examples of Criticality 4:
+**Examples of Criticality 4:**
+
 1. Any procedural invocation such as a SQL script, a ruby script module, a rake task which is performed on a production console server, either using `gitlab-rails` or `gitlab-rake`.
 1. Any invocation of an existing code pathway which ultimately will perform any mutate operation on live data.  This is distinguished from diagnostic investigation operations which should typically be limited to read-only operations.  It is ostensibly left to the discretion of the engineer whether or not a peer should be included to co-observe the invocation of such diagnostics.
  
-| Change Objective                                       | Describe the objective of the change                                                                                                             |
-|:-------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------|
-| Change Type                                            | Type described above                                                                                                                             |
-| Services Impacted                                      | List services                                                                                                                                    |
-| Change Team Members                                    | Name of the engineers involved in the change                                                                                                     |
-| Change Severity                                        | How critical is the change                                                                                                                       |
-| Change Reviewer or tested in staging                   | A colleague who will review the change or evidence the change was tested on staging environment                                                  |
-| Dry-run output                                         | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result                                                                                                                    |
-| Due Date                                               | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change           |
-| Time tracking                                          | To estimate and record times associated with changes ( including a possible rollback )                                                           |
+#### Approval
  
-Detailed steps for the change. For each step the following must be considered:
-  * pre-conditions for execution of the step - how to verify it is safe to proceed
-  * execution commands for the step - what to do
-  * post-execution validation for the step - how to verify the step succeeded
+No approval required.
  
-It is strongly recommended to:
-  * Note relevant graphs in grafana to monitor the effect of the change, including how to identify that it has worked, or has caused undue negative effects
-  * Review alerts that may go off that can be silenced pro-actively
+[Criticality 4 plan template](https://gitlab.com/gitlab-com/gl-infra/production/blob/master/.gitlab/issue_templates/change_c4.md)
  
-Rollback steps
-  * As for the original steps.  It may be acceptable to reference the change steps as the process, with variations (e.g. Revert commit and run deployment).
-  * It is acceptable to list a full rollback process, and allow for the applier to select where to start based on how far through they got.
-
-### Change plans summary
+### Change Plans Summary
  
 With change plans, we develop a solid library of change procedures. Even more importantly, they provide detailed blueprints for implementation of defensive automation. Adding on to the defensive automation, every change request that uses some sort of a script _must have a dry-run capability_, the script should be run in the dry-run mode and its output should be provided to the CR for review. Ideally, the planner and the executor should be different individuals.
  
@@ -209,15 +148,6 @@ The following table has the original schedule for changes based on the criticali
  
 Please consider the time slots on the calendar Production, to add change requests to Criticality 1 and 2. The other criticalities please add direct to the calendar.
  
-## Change Request Workflow
-
-To drive a change request to be executed on production :
-
-* [Create an issue in the production project](https://gitlab.com/gitlab-com/gl-infra/production/issues/new?issue&issuable_template=change_c1) and select a template with the criticality of the change
-* Add to the Production calendar in the timeslot desired or in the production calendar.
-* The issue will reviewed by SRE team management and Staff.
-* After getting the approval the change can be executed.
-
 ## Change Execution
  
 If the change is executed by a script, it should be run from the bastion host
@@ -254,8 +184,6 @@ Maintenance changes require change reviews. The reviews are intended to bring to
 | `CT`   | **Change Team** |
 |          | The Change Team is primarily composed of technical staff perfoming the change.|
  
-
-
 ## Communication Channels
  
 Information is a key asset during any change.  Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that a change is happening is critical in helping stakeholders plan for said changes.