Skip to content
Snippets Groups Projects

Add production readiness questionnaire

Merged username-removed-274314 requested to merge pc-add-production-change-procedure into master
1 unresolved thread

Helpful tool for anyone seeking to get their code into production to see how ready it is for that.

Production changes Checklist aspect moved to !5849 (merged)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    Compare with previous version

  • Ernst van Nierop resolved all discussions

    resolved all discussions

  • @stanhu Please review, and merge if you agree :-) @pcarranza and I are both quite pleased with this idea... some structure, not too much overhead. But we're also both out next week. So if you like it, please merge, then please also announce on team call.

  • assigned to @stanhu

  • Stan Hu
  • assigned to @ernstvn

  • Re-assigning to Pablo since we discussed expanding this a bit to include a checklist of things to consider when wondering about whether something is production-ready or not.

  • added 1 commit

    • 020a1ee9 - Add production readiness questionnaire to the infrastructure page

    Compare with previous version

  • Ernst van Nierop added 896 commits

    added 896 commits

    • 020a1ee9...0e0cc031 - 894 commits from branch master
    • d1ad39e2 - Merge branch 'master' into pc-add-production-change-procedure
    • e52c4a42 - edits to rename resources to people and clarify when the process applies

    Compare with previous version

  • @stanhu can you please give this another review? I made quite a few edits :scream:

  • I want to make further edits to reduce a creeping sense of burocracy... but how?

  • Ernst van Nierop resolved all discussions

    resolved all discussions

  • added 1 commit

    • 96f63a7c - further edits aimed at being more open and enabling

    Compare with previous version

  • Assigning to @stanhu for review.

  • assigned to @stanhu

  • @stanhu I think that the part that needs most work is the production readiness checklist, so I'm going to pull that out and put it in a separate MR.

  • Ernst van Nierop mentioned in merge request !5849 (merged)

    mentioned in merge request !5849 (merged)

  • Since the discussion here is about the Production Readiness Questionnaire, I will rename the MR to match. The other parts, of how to bring changes into Production, are now in !5849 (merged) which is set to merge when pipeline succeeds.

  • Ernst van Nierop changed title from Add production change procedure to Add production readiness questionnaire

    changed title from Add production change procedure to Add production readiness questionnaire

  • Maintainer

    Conflicts now?

  • Ernst van Nierop added 45 commits

    added 45 commits

    Compare with previous version

  • @stanhu conflicts removed, scope of MR reduced.

  • added 2 commits

    • ebdbf78d - Removed production readiness questionnaire to separate page
    • e6eb23f0 - Extracted production readiness questionaire to a new page for a conflict free discussion

    Compare with previous version

  • added 1014 commits

    Compare with previous version

  • I extracted the checklist to a new page to prevent conflicts, and then resolved the ones that were there.

  • @pcarranza @jtevnan I still struggle with this. To me, the purpose it not yet clear... or the "when should I use this questionnaire"? or "who should use this questionnaire?" . I don't want it to be the case that this questionnaire becomes a "hurdle" that must be passed before something is sufficiently developed so that it can face the next "hurdle" of going through the "making a change in production checklist". At the same time, some of the questions e.g. are we storing data and did a DB specialist look at it, belong in guidance for all open source developers in the gitlab-ce repo, not just in a production readiness questionnaire in our handbook. So... I'm a bit at a loss here.

    Can you please help me take a step back and answer the question:

    • why is this needed?
    • who needs it?
    • when should it be used?
    • is it a "must be used" or "nice to use" situation?
    • how does it mesh with gitlab-ce development guidelines?
    • how does it mesh with production change checklist?

    Sorry to seem to be difficult here... but not too sorry to be asking the questions that will make it abundantly clear if and where we need this :-)

  • Ernst van Nierop removed assignee

    removed assignee

  • @ernstvn Don't worry about asking questions, it helps to improve the process. It will be my pleasure to answer.

    • Why is this needed?

    This is needed for understanding what is the implication of introducing a new piece of infrastructure into the production environment.

    Think of the case of start using elastic search in production https://gitlab.com/gitlab-com/infrastructure/issues/1597. This involves building at least 3 to 5 servers that will need capacity planning, will need proper logging, will need some form of backups, will need HA, will increase the attack surface, will have security concerns, etc. And this new service will also need to be supported (owned) by the production engineering team.

    By reducing managing large pieces of infrastructure to a task of just throw this into a host and we'll figure it out without any form of analysis and planning is a ticket to a painful trip. Particularly painful for production engineers given our current structure and scale.

    Same thing for running Gitaly in production - do we need HA? do we need backups? who owns it?

    We just need to ask these questions to understand where are we getting into.

    • Who needs it?

    Given our current structure, we, production engineers, need it. Simply because we are increasing the pool of our ownership and responsibility, so we want to make this decision in an informed way, understanding what are we walking into, and ensuring that we have enough capacity to support a new piece of infrastructure or running service and how will it impact the rest of the pool.

    • When should it be used?

    Whenever someone needs to add a new piece of hardware (virtual) or software to production, call it a service, that will require us to support it.

    • Is it a "must be used" or "nice to use" situation?

    It is a must be used, else we will refuse to provision resources and own unknowns.

    • How does it mesh with gitlab-ce development guidelines?

    I don't know, development seems to be going at a different pace than production, and the responsibilities seem to be split in 2 for now, so we need to have this middle ground conversation of how does this looks like in production, and not in a tiny-without-any-load-or-durability development environment.

    • How does it mesh with production change checklist?

    This is a prior step for the change checklist.

    The result of this questionnaire is to understand if it is in a state in which it can actually be run in production? can we own this piece of infrastructure given our capacity? will we know how to keep it running and how to recover it when it fails?

    The result of the production change checklist is that we have something deployed in production.

    As you can see this is a 2 steps process.

  • Thanks @pcarranza .

    It is a must be used, else we will refuse to provision resources and own unknowns.

    This is what I was imagining to be the case, and it is understandable, but as a consequence it feels more like a hurdle than like an aid to getting things rolled out. In that sense it feels like it works in a direction opposite to integrating / embedding production engineering knowledge and talent in development and vice versa.

    Perhaps a better way to think about this, is as a helpful questionnaire for Reliability Experts and Production Engineers embedded with Development teams to use as a guide in thinking through scalability, durability, maintainability?

  • @ernstvn I think that the way of thinking this is that whenever we want to ship something to production someone will end up owning it. The way we work now is that we ship it and then production engineering has to start reverse engineering what was shipped. This is trying to turn it around.

    This questionaire forces the thinking in scalability, durability, maintainability and security.

    From our perspective it will be used to understand what is the starting point of a service to have a reasonable expectation of how it will behave in production.

    This means: we can get something in production, even after answering all the questions with I don't know. But given this starting point, the way the service will behave and the SLA we can offer will also match that I don't know, simply because we will be starting in a point of this is an experiment that is not even understood by the original builder

    So, it is a must to answer it even though it doesn't require that you know everything, but this sets the grounds of showing what is it that we will need to learn and explore in the way, which will also help us doing capacity planning.

  • @bjk-gitlab I would like to get your input here too if possible.

  • I agree with @ernstvn, we don't want to get into hoop jumping if we can avoid it. There are some discussions/changes needed to the operational structure of running gitlab.com that need to change for this to go from being a "guide" to being a "review". See the PRR Model. I don't think we're big enough of an org for that to work at this time.

  • Picking this one up, will make some edits and assign to @pcarranza for review.

  • Ernst van Nierop resolved all discussions

    resolved all discussions

  • Ernst van Nierop added 1190 commits

    added 1190 commits

    • 2ad2f433...95c717a6 - 1187 commits from branch master
    • 04b9c7d1 - various edits based on discussion
    • 3a45e498 - Move page to its own folder, and add links back to it
    • cd60b321 - Merge branch 'master' into pc-add-production-change-procedure

    Compare with previous version

  • @pcarranza Please review. You may want to review the individual commits since I first edited the content, and I then moved the entire page to a new location...

  • Love it, let's iterate!

  • mentioned in commit 2a488f4c

  • 21 21 - [Performance](/handbook/engineering/performance)
    22 22 - [Issue Triage Policies](/handbook/engineering/issues/issue-triage-policies)
    23 23 - [Critical Security Release Process](/handbook/engineering/critical-release-process)
    24 - [Performance of GitLab](/handbook/engineering/performance)
    Please register or sign in to reply
    Loading