Skip to content
Snippets Groups Projects

WIP: Re-organization of Infrastructure handbook, and description of goals / priorities

Closed Ernst van Nierop requested to merge evn-shuffle-infra into master
14 unresolved threads

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
301 301 locality: Hilversum
  • Ernst van Nierop
  • Ernst van Nierop
  • Ernst van Nierop
  • Ernst van Nierop
  • 49
    50 1. Security: reduce risk to its minimum, and make the minimum explicit.
    51 1. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise.
    52 1. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way.
    59 - [Public monitoring infrastructure](http://monitor.gitlab.net/):
    60 - No auth is required
    61 - Automatically syncs from the private monitoring infrastructure on every chef client execution. Don't change dashboards here, they will be overwritten.
    62 - [Private monitoring infrastructure](https://performance.gitlab.net):
    63 - Highly Available setup
    64 - Alerting feeds from this setup
    65 - Toggle "public" to have the dashboard appear on the Public monitoring infrastructure. **Default to making public, unless you can specify a good reason not to**.
    66 - Private GitLab account is required to access
    67 - Separated from the public for security and availability reasons, they should have exactly the same graphs after we deprecate InfluxDB
    53 68
    54 #### Production and Staging Access
    69 ## Production and Staging Access
  • 72 - You can find the infrastructure archive [here](https://docs.google.com/document/d/19yzyIHY9F_m5p0B0e6STSZyhzfo-vLIRVQ1zRRevWRM/edit#heading=h.lz1c6r6c9ejd).
    73 - Automated tasks and schedules
    74 - Weekly automatic OS updates are performed on Monday at 10:10 UTC.
    75 - Monitoring: we do monitoring with prometheus leveraging available exporters like the node or the postgresql exporters, and we build whatever else is necessary within production engineering itself. We maintain 2 monitoring infrastructures:
    76 - [Public monitoring infrastructure](http://monitor.gitlab.net/):
    77 - No auth is required
    78 - Is automatically sync from the private monitoring infrastructure on every chef client execution. Don't change dashboards here, they will be overwritten.
    79 - [Private monitoring infrastructure](https://performance.gitlab.net):
    80 - Highly Available setup
    81 - Alerting feeds from this setup
    82 - Private GitLab account is required to access
    83 - Separated from the public for security and availability reasons, they should have exactly the same graphs after we deprecate InfluxDB
    79 ## Automated tasks and schedules
    84 80
    85 ## Documentation
    81 Weekly automatic OS updates are performed on Monday at 10:10 UTC.
  • 25
    26 Production engineers also have a strong focus on building the right toolsets
    27 and automations to enable development to ship features as fast and bug free as
    28 possible, leveraging the tools provided by GitLab.com itself - we must dogfood.
    29
    30 Another part of the job is building monitoring tools that allow quick
    31 troubleshooting as a first step, then turning this into alerts to notify based on
    32 symptoms, to then fixing the problem or automating the remediation. We can only scale
    33 GitLab.com by being smart and using resources effectively, starting with our
    34 own time as the main scarce resource.
    35
    36 ### Tenets
    37
    38 1. Security: reduce risk to its minimum, and make the minimum explicit.
    39 1. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise.
    40 1. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way.
  • 41
    42
    43 ## Prioritizing Issues
    44
    45 Given the variety of responsibilities and number of "interfaces" between the Production
    46 team and all the other teams at GitLab, here is a guideline on how to prioritize
    47 the issues we work on. Basing this on the [goals of the Infrastructure team](../#infragoals) as
    48 well as our [values](/handbook/values/) and [workflows](handbook/engineering/workflow)
    49 as a company as whole, the priority should be:
    50
    51 1. keeping GitLab.com available - and secure
    52 1. unblocking others
    53 1. automating tasks to reduce toil and increase _team_ availability (but be explicit about the costs and benefits)
    54 1. improving performance of GitLab.com while being conscious of cost
    55 1. reducing costs of running GitLab.com
    56
  • 57 1. reducing costs of running GitLab.com
    58
    59 ### Labeling Issues
    60
    61 We use [issue labels](https://gitlab.com/gitlab-com/infrastructure/labels) to
    62 assist in organizing issues within the Infrastructure issue tracker. Prioritized labels are
    63
    64 - `~critical`
    65 - `~security`
    66 - `~chef`
    67 - `~noise`
    68 - `~blocked`
    69
    70 Issues in this tracker should be organized into [boards](https://gitlab.com/gitlab-com/infrastructure/boards)
    71 and [milestones](https://gitlab.com/gitlab-com/infrastructure/milestones) to define projects and timelines respectively.
    72
    • @pcarranza See lines above, I think we should pick some categories of work / projects, sort issues by projects since there are 300 open. I didn't do that work yet... but if you agree it is the better way to tackle projects (e.g. the fleet switch to ARM could have been done this way), then let's spend time on it.

    • Please register or sign in to reply
  • 59 ### Labeling Issues
    60
    61 We use [issue labels](https://gitlab.com/gitlab-com/infrastructure/labels) to
    62 assist in organizing issues within the Infrastructure issue tracker. Prioritized labels are
    63
    64 - `~critical`
    65 - `~security`
    66 - `~chef`
    67 - `~noise`
    68 - `~blocked`
    69
    70 Issues in this tracker should be organized into [boards](https://gitlab.com/gitlab-com/infrastructure/boards)
    71 and [milestones](https://gitlab.com/gitlab-com/infrastructure/milestones) to define projects and timelines respectively.
    72
    73 ### Production Priority Labels
    74
    • And... this one is likely to generate discussion. I think it will be helpful to have it be explicit how urgent and impactful issues are... so we can better prioritize and better escalate things that really hurt to the wider product and development teams.

    • Please register or sign in to reply
  • 106 - Performance improvement (or avoiding degradation) likely, but not clear if a lot.
    107 - Cost reduction by $10k/yr
    108 - I3
    109 - No outage expected from this.
    110 - Toil/noise reduction by <= 30 mins per week.
    111 - No significant performance improvement (or avoiding degradation).
    112 - No significant cost reduction.
    113
    114
    115 | **Urgency \ Impact** | **I1 - High** | **I2 - Medium** | **I3 - Low** |
    116 |----------------------------|---------------|------------------|----------------|
    117 | **U1 - High** | `PP1` | `PP1` | `PP2` |
    118 | **U2 - Medium** | `PP1` | `PP2` | `PP3` |
    119 | **U3 - Low** | `PP2` | `PP3` | `PP3` |
    120
    121 ## Production events logging
  • Ernst van Nierop changed title from WIP: Clarifying handbook structure with team structure to WIP: Re-organization of Infrastructure handbook, and description of goals / priorities

    changed title from WIP: Clarifying handbook structure with team structure to WIP: Re-organization of Infrastructure handbook, and description of goals / priorities

  • yorickpeterse-staging
  • yorickpeterse-staging
  • yorickpeterse-staging
  • yorickpeterse-staging
  • username-removed-274314
  • username-removed-274314
  • username-removed-274314
  • username-removed-274314
  • username-removed-274314
  • username-removed-274314
  • username-removed-274314
  • 35 51
    36 Production engineers also have a strong focus on building the right toolsets
    37 and automations to enable development to ship features as fast and bug free as
    38 possible, leveraging the tools provided by GitLab.com itself - we must dogfood.
    39 52
    40 Another part of the job is building monitoring tools that allow quick
    41 troubleshooting as a first step, then turning this into alerts to notify based on
    42 symptoms, to then fixing the problem or automating the remediation. We can only scale
    43 GitLab.com by being smart and using resources effectively, starting with our
    44 own time as the main scarce resource.
    53 ## Monitoring
    45 54
    46 [Production Engineer](jobs/production-engineer/index.html) job description.
    55 We do monitoring with Prometheus, leveraging available exporters like the node
    56 or the postgresql exporters, and we build whatever else is necessary within
    57 production engineering itself. We maintain 2 monitoring infrastructures:
    • Which team is the one that maintains the monitoring infrastructure? from the previous teams definition it seems like there is no direct owner of the monitoring infrastructure.

    • Good question. The Prometheus team develops prometheus itself and the integration in GitLab... what role do / should they have in maintaining the underlying infrastructure? \cc @stanhu

    • The Prometheus Team's goal is the development of prometheus as a monitoring platform and it's integration within the GitLab product. I think it would be a waste of resources to make then responsible for monitoring the production GitLab.com and subsequent ancillary GitLab resources. Monitoring of the infrastructure should be own and managed by the production team, who is tasking with architecting, deploying, and managing the resources... the monitoring of those as a check and balance to the architecture and deployment (as well as an aid in management) flows naturally from that.

    • Please register or sign in to reply
  • 59 74
    60 75 Any other engineer, or lead, or manager at any level will not have access to production, and, in case some information is needed from production it must be obtained by a production engineer through an issue in the infrastructure issue tracker.
  • 59 74
    60 75 Any other engineer, or lead, or manager at any level will not have access to production, and, in case some information is needed from production it must be obtained by a production engineer through an issue in the infrastructure issue tracker.
    61 76
    62 There is one temporary exception: [release managers](/release-managers) require production access to perform deploys, they will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort.
    77 There is one temporary exception: release managers require production access to perform deploys, they will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort.
  • 101 - What to do when: points to specific runbooks to run on stressful situations (on-call)
    102 - How do I: points to general administration texts that explain how to perform different administration tasks.
    95 - _What to do when_: points to specific runbooks to run in stressful situations (on-call)
    96 - _How do I_: points to general administration texts that explain how to perform different administration tasks.
    103 97
    104 98 When writing a new runbook, be mindful what the goal of it is:
    105 99
    100 - General documentation that customers may also benefit from: don't write it in the runbooks, write it into [GitLab documentation](https://docs.gitlab.com/).
    106 101 - If it is for on-call situations, make it crisp and brief. Try to keep the following structure: pre-check, resolution, post-check .
    107 102 - If it is for general management, it can be freely formatted.
    108 103
    109 ### Chef cookbooks
    104 ## Chef cookbooks
    105
    106 Generally our [chef cookbooks](https://gitlab.com/groups/gitlab-cookbooks) live in the open, and they get mirrored back to our
    107 [internal cookbooks group](https://dev.gitlab.org/cookbooks) for availability reasons.
  • 118 116 - Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can.
    119 117 - Cookbooks should be covered with ChefSpec and TestKitchen testing in order to ensure they do what they are supposed to and don't have conflicts.
    120 118
    121 Generally our [chef cookbooks](https://gitlab.com/groups/gitlab-cookbooks) live in the open, and they get mirrored back to our
    122 [internal cookbooks group](https://dev.gitlab.org/cookbooks) for availability reasons.
    119 There may be cases of cookbooks that could become a security concern, in which case it is OK to keep them on our GitLab
    120 private instance. This should be assessed on a case by case basis, and documented properly.
  • 118 116 - Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can.
    119 117 - Cookbooks should be covered with ChefSpec and TestKitchen testing in order to ensure they do what they are supposed to and don't have conflicts.
    120 118
    121 Generally our [chef cookbooks](https://gitlab.com/groups/gitlab-cookbooks) live in the open, and they get mirrored back to our
    122 [internal cookbooks group](https://dev.gitlab.org/cookbooks) for availability reasons.
    119 There may be cases of cookbooks that could become a security concern, in which case it is OK to keep them on our GitLab
    120 private instance. This should be assessed on a case by case basis, and documented properly.
    123 121
    124 There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab
    125 private instance. This should be assessed in a case by case and documented properly.
    122 ### Documentation specific to GitLab.com
    126 123
    127 ### Internal documentation
    124 There is some documentation that is specific to GitLab.com available in the [private Chef Repo](https://dev.gitlab.org/cookbooks/chef-repo).
    125 Things that are specific to our infrastructure
    126 providers or that would create a security threat for our installation are documented there.
  • Andrew Newdigate
  • Andrew Newdigate
  • This merge request has become too unwieldy. I am going to close it and start new smaller ones. Will link them back here.

  • Ernst van Nierop mentioned in merge request !5981 (merged)

    mentioned in merge request !5981 (merged)

  • Ernst van Nierop mentioned in merge request !5982 (merged)

    mentioned in merge request !5982 (merged)

  • Move handbook/database to handbook/infrastructure/database and make associated edits: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5982

  • Ernst van Nierop mentioned in merge request !5983 (merged)

    mentioned in merge request !5983 (merged)

  • Moving handbook/gitaly to handbook/infrastructure/gitaly and making associated edits: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5983

  • Moving production-specific items from handbook/infrastructure to handbook/infrastructure/production https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/5998

  • Ernst van Nierop mentioned in merge request !6019 (merged)

    mentioned in merge request !6019 (merged)

  • Proposal for priority labels (does not seem urgent at the moment) https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6019

  • Ernst van Nierop mentioned in merge request !6020 (merged)

    mentioned in merge request !6020 (merged)

  • Highlighting monitoring infra and point about making it public: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/6020

  • Please register or sign in to reply
    Loading