WIP: Re-organization of Infrastructure handbook, and description of goals / priorities
Merge request reports
Activity
added 110 commits
-
97eaf1bb...f8da540c - 107 commits from branch
master
- 04fd0dca - Merge branch 'master' into evn-shuffle-infra
- e8effac2 - edits on main infra page
- b94473dd - remove duplicate sections on production page
Toggle commit list-
97eaf1bb...f8da540c - 107 commits from branch
added 110 commits
-
97eaf1bb...f8da540c - 107 commits from branch
master
- 04fd0dca - Merge branch 'master' into evn-shuffle-infra
- e8effac2 - edits on main infra page
- b94473dd - remove duplicate sections on production page
Toggle commit list-
97eaf1bb...f8da540c - 107 commits from branch
added 661 commits
-
b94473dd...84d6b69e - 658 commits from branch
master
- a1c1d28b - update w/ master
- cb3cce00 - Merge branch 'master' into evn-shuffle-infra
- aa6f2a1b - Further word smithing
Toggle commit list-
b94473dd...84d6b69e - 658 commits from branch
@gl-infra ready for your comments!
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
49 50 1. Security: reduce risk to its minimum, and make the minimum explicit. 51 1. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise. 52 1. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way. 59 - [Public monitoring infrastructure](http://monitor.gitlab.net/): 60 - No auth is required 61 - Automatically syncs from the private monitoring infrastructure on every chef client execution. Don't change dashboards here, they will be overwritten. 62 - [Private monitoring infrastructure](https://performance.gitlab.net): 63 - Highly Available setup 64 - Alerting feeds from this setup 65 - Toggle "public" to have the dashboard appear on the Public monitoring infrastructure. **Default to making public, unless you can specify a good reason not to**. 66 - Private GitLab account is required to access 67 - Separated from the public for security and availability reasons, they should have exactly the same graphs after we deprecate InfluxDB 53 68 54 #### Production and Staging Access 69 ## Production and Staging Access This section will need work per https://gitlab.com/gitlab-com/infrastructure/issues/1231
For the time being we could just include senior service engineers to have them covered in the handbook itself. I think that this is what is happening de-facto right now
72 - You can find the infrastructure archive [here](https://docs.google.com/document/d/19yzyIHY9F_m5p0B0e6STSZyhzfo-vLIRVQ1zRRevWRM/edit#heading=h.lz1c6r6c9ejd). 73 - Automated tasks and schedules 74 - Weekly automatic OS updates are performed on Monday at 10:10 UTC. 75 - Monitoring: we do monitoring with prometheus leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary within production engineering itself. We maintain 2 monitoring infrastructures: 76 - [Public monitoring infrastructure](http://monitor.gitlab.net/): 77 - No auth is required 78 - Is automatically sync from the private monitoring infrastructure on every chef client execution. Don't change dashboards here, they will be overwritten. 79 - [Private monitoring infrastructure](https://performance.gitlab.net): 80 - Highly Available setup 81 - Alerting feeds from this setup 82 - Private GitLab account is required to access 83 - Separated from the public for security and availability reasons, they should have exactly the same graphs after we deprecate InfluxDB 79 ## Automated tasks and schedules 84 80 85 ## Documentation 81 Weekly automatic OS updates are performed on Monday at 10:10 UTC. 25 26 Production engineers also have a strong focus on building the right toolsets 27 and automations to enable development to ship features as fast and bug free as 28 possible, leveraging the tools provided by GitLab.com itself - we must dogfood. 29 30 Another part of the job is building monitoring tools that allow quick 31 troubleshooting as a first step, then turning this into alerts to notify based on 32 symptoms, to then fixing the problem or automating the remediation. We can only scale 33 GitLab.com by being smart and using resources effectively, starting with our 34 own time as the main scarce resource. 35 36 ### Tenets 37 38 1. Security: reduce risk to its minimum, and make the minimum explicit. 39 1. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise. 40 1. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way. 41 42 43 ## Prioritizing Issues 44 45 Given the variety of responsibilities and number of "interfaces" between the Production 46 team and all the other teams at GitLab, here is a guideline on how to prioritize 47 the issues we work on. Basing this on the [goals of the Infrastructure team](../#infragoals) as 48 well as our [values](/handbook/values/) and [workflows](handbook/engineering/workflow) 49 as a company as whole, the priority should be: 50 51 1. keeping GitLab.com available - and secure 52 1. unblocking others 53 1. automating tasks to reduce toil and increase _team_ availability (but be explicit about the costs and benefits) 54 1. improving performance of GitLab.com while being conscious of cost 55 1. reducing costs of running GitLab.com 56 See listed priorities above. This is my proposal, welcome input. @stanhu @pcarranza
We need to identify what does it mean to unblock others a bit better. This could mean that whenever we have a task that just keeps popping up and it is blocking others, we would never reach the point where we automate it.
Also, I'm missing the scope of what unblocking others mean, the way it's written it could apply to pretty much anything.
57 1. reducing costs of running GitLab.com 58 59 ### Labeling Issues 60 61 We use [issue labels](https://gitlab.com/gitlab-com/infrastructure/labels) to 62 assist in organizing issues within the Infrastructure issue tracker. Prioritized labels are 63 64 - `~critical` 65 - `~security` 66 - `~chef` 67 - `~noise` 68 - `~blocked` 69 70 Issues in this tracker should be organized into [boards](https://gitlab.com/gitlab-com/infrastructure/boards) 71 and [milestones](https://gitlab.com/gitlab-com/infrastructure/milestones) to define projects and timelines respectively. 72 @pcarranza See lines above, I think we should pick some categories of work / projects, sort issues by projects since there are 300 open. I didn't do that work yet... but if you agree it is the better way to tackle projects (e.g. the fleet switch to ARM could have been done this way), then let's spend time on it.
59 ### Labeling Issues 60 61 We use [issue labels](https://gitlab.com/gitlab-com/infrastructure/labels) to 62 assist in organizing issues within the Infrastructure issue tracker. Prioritized labels are 63 64 - `~critical` 65 - `~security` 66 - `~chef` 67 - `~noise` 68 - `~blocked` 69 70 Issues in this tracker should be organized into [boards](https://gitlab.com/gitlab-com/infrastructure/boards) 71 and [milestones](https://gitlab.com/gitlab-com/infrastructure/milestones) to define projects and timelines respectively. 72 73 ### Production Priority Labels 74 106 - Performance improvement (or avoiding degradation) likely, but not clear if a lot. 107 - Cost reduction by $10k/yr 108 - I3 109 - No outage expected from this. 110 - Toil/noise reduction by <= 30 mins per week. 111 - No significant performance improvement (or avoiding degradation). 112 - No significant cost reduction. 113 114 115 | **Urgency \ Impact** | **I1 - High** | **I2 - Medium** | **I3 - Low** | 116 |----------------------------|---------------|------------------|----------------| 117 | **U1 - High** | `PP1` | `PP1` | `PP2` | 118 | **U2 - Medium** | `PP1` | `PP2` | `PP3` | 119 | **U3 - Low** | `PP2` | `PP3` | `PP3` | 120 121 ## Production events logging I kept this section, but it may need some updating @pcarranza ?
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
35 51 36 Production engineers also have a strong focus on building the right toolsets 37 and automations to enable development to ship features as fast and bug free as 38 possible, leveraging the tools provided by GitLab.com itself - we must dogfood. 39 52 40 Another part of the job is building monitoring tools that allow quick 41 troubleshooting as a first step, then turning this into alerts to notify based on 42 symptoms, to then fixing the problem or automating the remediation. We can only scale 43 GitLab.com by being smart and using resources effectively, starting with our 44 own time as the main scarce resource. 53 ## Monitoring 45 54 46 [Production Engineer](jobs/production-engineer/index.html) job description. 55 We do monitoring with Prometheus, leveraging available exporters like the node 56 or the postgresql exporters, and we build whatever else is necessary within 57 production engineering itself. We maintain 2 monitoring infrastructures: Good question. The Prometheus team develops prometheus itself and the integration in GitLab... what role do / should they have in maintaining the underlying infrastructure? \cc @stanhu
The Prometheus Team's goal is the development of prometheus as a monitoring platform and it's integration within the GitLab product. I think it would be a waste of resources to make then responsible for monitoring the production GitLab.com and subsequent ancillary GitLab resources. Monitoring of the infrastructure should be own and managed by the production team, who is tasking with architecting, deploying, and managing the resources... the monitoring of those as a check and balance to the architecture and deployment (as well as an aid in management) flows naturally from that.
59 74 60 75 Any other engineer, or lead, or manager at any level will not have access to production, and, in case some information is needed from production it must be obtained by a production engineer through an issue in the infrastructure issue tracker. 59 74 60 75 Any other engineer, or lead, or manager at any level will not have access to production, and, in case some information is needed from production it must be obtained by a production engineer through an issue in the infrastructure issue tracker. 61 76 62 There is one temporary exception: [release managers](/release-managers) require production access to perform deploys, they will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort. 77 There is one temporary exception: release managers require production access to perform deploys, they will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort. 101 - What to do when: points to specific runbooks to run on stressful situations (on-call) 102 - How do I: points to general administration texts that explain how to perform different administration tasks. 95 - _What to do when_: points to specific runbooks to run in stressful situations (on-call) 96 - _How do I_: points to general administration texts that explain how to perform different administration tasks. 103 97 104 98 When writing a new runbook, be mindful what the goal of it is: 105 99 100 - General documentation that customers may also benefit from: don't write it in the runbooks, write it into [GitLab documentation](https://docs.gitlab.com/). 106 101 - If it is for on-call situations, make it crisp and brief. Try to keep the following structure: pre-check, resolution, post-check . 107 102 - If it is for general management, it can be freely formatted. 108 103 109 ### Chef cookbooks 104 ## Chef cookbooks 105 106 Generally our [chef cookbooks](https://gitlab.com/groups/gitlab-cookbooks) live in the open, and they get mirrored back to our 107 [internal cookbooks group](https://dev.gitlab.org/cookbooks) for availability reasons. 118 116 - Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can. 119 117 - Cookbooks should be covered with ChefSpec and TestKitchen testing in order to ensure they do what they are supposed to and don't have conflicts. 120 118 121 Generally our [chef cookbooks](https://gitlab.com/groups/gitlab-cookbooks) live in the open, and they get mirrored back to our 122 [internal cookbooks group](https://dev.gitlab.org/cookbooks) for availability reasons. 119 There may be cases of cookbooks that could become a security concern, in which case it is OK to keep them on our GitLab 120 private instance. This should be assessed on a case by case basis, and documented properly. 118 116 - Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can. 119 117 - Cookbooks should be covered with ChefSpec and TestKitchen testing in order to ensure they do what they are supposed to and don't have conflicts. 120 118 121 Generally our [chef cookbooks](https://gitlab.com/groups/gitlab-cookbooks) live in the open, and they get mirrored back to our 122 [internal cookbooks group](https://dev.gitlab.org/cookbooks) for availability reasons. 119 There may be cases of cookbooks that could become a security concern, in which case it is OK to keep them on our GitLab 120 private instance. This should be assessed on a case by case basis, and documented properly. 123 121 124 There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab 125 private instance. This should be assessed in a case by case and documented properly. 122 ### Documentation specific to GitLab.com 126 123 127 ### Internal documentation 124 There is some documentation that is specific to GitLab.com available in the [private Chef Repo](https://dev.gitlab.org/cookbooks/chef-repo). 125 Things that are specific to our infrastructure 126 providers or that would create a security threat for our installation are documented there. - Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
mentioned in merge request !5981 (merged)
Changing title name everywhere: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5981
mentioned in merge request !5982 (merged)
Move handbook/database to handbook/infrastructure/database and make associated edits: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5982
mentioned in merge request !5983 (merged)
Moving handbook/gitaly to handbook/infrastructure/gitaly and making associated edits: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5983
Moving production-specific items from handbook/infrastructure to handbook/infrastructure/production https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5998
mentioned in merge request !6019 (merged)
Proposal for priority labels (does not seem urgent at the moment) https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019
mentioned in merge request !6020 (merged)
Highlighting monitoring infra and point about making it public: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6020