Add production readiness questionnaire
Helpful tool for anyone seeking to get their code into production to see how ready it is for that.
Production changes Checklist aspect moved to !5849 (merged)
Merge request reports
Activity
- Resolved by Ernst van Nierop
mentioned in issue infrastructure#1544 (closed)
@stanhu Please review, and merge if you agree :-) @pcarranza and I are both quite pleased with this idea... some structure, not too much overhead. But we're also both out next week. So if you like it, please merge, then please also announce on team call.
mentioned in issue infrastructure#1361 (closed)
assigned to @stanhu
- Resolved by Ernst van Nierop
assigned to @ernstvn
assigned to @pcarranza
When this merges in, it will close https://gitlab.com/gitlab-com/infrastructure/issues/1240
added 1 commit
- 020a1ee9 - Add production readiness questionnaire to the infrastructure page
assigned to @ernstvn
mentioned in issue infrastructure#1240 (closed)
mentioned in issue infrastructure#1649 (closed)
added 896 commits
-
020a1ee9...0e0cc031 - 894 commits from branch
master
- d1ad39e2 - Merge branch 'master' into pc-add-production-change-procedure
- e52c4a42 - edits to rename resources to people and clarify when the process applies
-
020a1ee9...0e0cc031 - 894 commits from branch
@stanhu can you please give this another review? I made quite a few edits
- Resolved by Ernst van Nierop
- Resolved by username-removed-274314
- Resolved by username-removed-274314
- Resolved by username-removed-274314
- Resolved by Ernst van Nierop
- Resolved by username-removed-274314
added 1 commit
- 96f63a7c - further edits aimed at being more open and enabling
Assigning to @stanhu for review.
assigned to @stanhu
assigned to @ernstvn
@stanhu I think that the part that needs most work is the production readiness checklist, so I'm going to pull that out and put it in a separate MR.
mentioned in merge request !5849 (merged)
Since the discussion here is about the Production Readiness Questionnaire, I will rename the MR to match. The other parts, of how to bring changes into Production, are now in !5849 (merged) which is set to merge when pipeline succeeds.
added 45 commits
-
96f63a7c...cef6f6b1 - 42 commits from branch
master
- d22fb87e - Remove everything related to checklist
- 197d30b5 - Fixing merge conflicts
- 7c885f9b - Move questionnaire to below checklist
Toggle commit list-
96f63a7c...cef6f6b1 - 42 commits from branch
@stanhu conflicts removed, scope of MR reduced.
assigned to @pcarranza
added 1014 commits
-
e6eb23f0...b85be855 - 1013 commits from branch
master
- 2ad2f433 - Merge branch 'master' into 'pc-add-production-change-procedure'
-
e6eb23f0...b85be855 - 1013 commits from branch
assigned to @ernstvn
@pcarranza @jtevnan I still struggle with this. To me, the purpose it not yet clear... or the "when should I use this questionnaire"? or "who should use this questionnaire?" . I don't want it to be the case that this questionnaire becomes a "hurdle" that must be passed before something is sufficiently developed so that it can face the next "hurdle" of going through the "making a change in production checklist". At the same time, some of the questions e.g. are we storing data and did a DB specialist look at it, belong in guidance for all open source developers in the gitlab-ce repo, not just in a production readiness questionnaire in our handbook. So... I'm a bit at a loss here.
Can you please help me take a step back and answer the question:
- why is this needed?
- who needs it?
- when should it be used?
- is it a "must be used" or "nice to use" situation?
- how does it mesh with gitlab-ce development guidelines?
- how does it mesh with production change checklist?
Sorry to seem to be difficult here... but not too sorry to be asking the questions that will make it abundantly clear if and where we need this :-)
@ernstvn Don't worry about asking questions, it helps to improve the process. It will be my pleasure to answer.
- Why is this needed?
This is needed for understanding what is the implication of introducing a new piece of infrastructure into the production environment.
Think of the case of start using elastic search in production https://gitlab.com/gitlab-com/infrastructure/issues/1597. This involves building at least 3 to 5 servers that will need capacity planning, will need proper logging, will need some form of backups, will need HA, will increase the attack surface, will have security concerns, etc. And this new service will also need to be supported (owned) by the production engineering team.
By reducing managing large pieces of infrastructure to a task of just throw this into a host and we'll figure it out without any form of analysis and planning is a ticket to a painful trip. Particularly painful for production engineers given our current structure and scale.
Same thing for running Gitaly in production - do we need HA? do we need backups? who owns it?
We just need to ask these questions to understand where are we getting into.
- Who needs it?
Given our current structure, we, production engineers, need it. Simply because we are increasing the pool of our ownership and responsibility, so we want to make this decision in an informed way, understanding what are we walking into, and ensuring that we have enough capacity to support a new piece of infrastructure or running service and how will it impact the rest of the pool.
- When should it be used?
Whenever someone needs to add a new piece of hardware (virtual) or software to production, call it a service, that will require us to support it.
- Is it a "must be used" or "nice to use" situation?
It is a must be used, else we will refuse to provision resources and own unknowns.
- How does it mesh with gitlab-ce development guidelines?
I don't know, development seems to be going at a different pace than production, and the responsibilities seem to be split in 2 for now, so we need to have this middle ground conversation of how does this looks like in production, and not in a tiny-without-any-load-or-durability development environment.
- How does it mesh with production change checklist?
This is a prior step for the change checklist.
The result of this questionnaire is to understand if it is in a state in which it can actually be run in production? can we own this piece of infrastructure given our capacity? will we know how to keep it running and how to recover it when it fails?
The result of the production change checklist is that we have something deployed in production.
As you can see this is a 2 steps process.
Thanks @pcarranza .
It is a must be used, else we will refuse to provision resources and own unknowns.
This is what I was imagining to be the case, and it is understandable, but as a consequence it feels more like a hurdle than like an aid to getting things rolled out. In that sense it feels like it works in a direction opposite to integrating / embedding production engineering knowledge and talent in development and vice versa.
Perhaps a better way to think about this, is as a helpful questionnaire for Reliability Experts and Production Engineers embedded with Development teams to use as a guide in thinking through scalability, durability, maintainability?
@ernstvn I think that the way of thinking this is that whenever we want to ship something to production someone will end up owning it. The way we work now is that we ship it and then production engineering has to start reverse engineering what was shipped. This is trying to turn it around.
This questionaire forces the thinking in scalability, durability, maintainability and security.
From our perspective it will be used to understand what is the starting point of a service to have a reasonable expectation of how it will behave in production.
This means: we can get something in production, even after answering all the questions with I don't know. But given this starting point, the way the service will behave and the SLA we can offer will also match that I don't know, simply because we will be starting in a point of this is an experiment that is not even understood by the original builder
So, it is a must to answer it even though it doesn't require that you know everything, but this sets the grounds of showing what is it that we will need to learn and explore in the way, which will also help us doing capacity planning.
@bjk-gitlab I would like to get your input here too if possible.
I agree with @ernstvn, we don't want to get into hoop jumping if we can avoid it. There are some discussions/changes needed to the operational structure of running gitlab.com that need to change for this to go from being a "guide" to being a "review". See the PRR Model. I don't think we're big enough of an org for that to work at this time.
- Resolved by Ernst van Nierop
- Resolved by Ernst van Nierop
Picking this one up, will make some edits and assign to @pcarranza for review.
assigned to @ernstvn
added 1190 commits
-
2ad2f433...95c717a6 - 1187 commits from branch
master
- 04b9c7d1 - various edits based on discussion
- 3a45e498 - Move page to its own folder, and add links back to it
- cd60b321 - Merge branch 'master' into pc-add-production-change-procedure
Toggle commit list-
2ad2f433...95c717a6 - 1187 commits from branch
@pcarranza Please review. You may want to review the individual commits since I first edited the content, and I then moved the entire page to a new location...
assigned to @pcarranza
mentioned in commit 2a488f4c
21 21 - [Performance](/handbook/engineering/performance) 22 22 - [Issue Triage Policies](/handbook/engineering/issues/issue-triage-policies) 23 23 - [Critical Security Release Process](/handbook/engineering/critical-release-process) 24 - [Performance of GitLab](/handbook/engineering/performance) @pcarranza This link is already above, as Performance.