[META] Stabilize deployment procedure
We have a lot of problems with our current deployment procedure. One of the main ones is that they have no tests whatsoever, so when we introduce a change we can only see how it behaves in production.
This leads us to not want to ever change it, or add any major change to it, mostly because we are stuck in the fear cycle.
We need to overhaul this process completely by removing it from a rake task into a full blown project that has solid coverage and that we can change confidently.
How to do it?
The steps to get better are the following:
-
Move the rake task out of the chef-repo to a separate project without changing it #1905 (closed) -
Start adding smoke and integration level testing of this process so we gain confidence with testing visibility. -
As we add integration tests, we factor out the current procedure into specific units of work that only have an execution step and are covered by rspec
tests (or similar) -
When we are done extracting this execution steps, we start improving the process by surrounding them with pre and post checks adding the required validations that are proposed in multiple issues. -
Rinse and repeat until we are confident with our deployment procedure. -
Move to the next step of deploying using Terraform and prebuilt images (pending multiple other tasks from canary deployment)
Relevant links
- Revamp the deployment procedure using Terraform #1739 - next step to move closer to Canary deployments #1504.
- Inconsistency in the rakefile #1612 (closed)
- Add prechecks to the deployment process #1603 (closed)
- Disable chef-client across the fleet before starting the deployment #1456 (closed)
- Stop the API when performing downtime deployments #1401 (closed) - maybe not relevant anymore as downtime deployments are forbidden now.
- Rebuild the deployment procedure #1386 (closed)
- Issues encountered during RC2 deployment #1365
- Manual restart of unicorn is usually needed when deploying #575 (closed)
- Deployment ended up wrong because we pick the wrong version #347 (closed)
- Write runbook explaining how to recover from a bad deployment #244 (closed)
- Improve deploy page message #74 (closed)
- Gitaly is not being updated on deploy on the NFS servers #1890 (closed)
- Restart processes in a given order #2107 (closed)
- Post-migrations may go unexecuted in between versions https://gitlab.com/gitlab-com/infrastructure/issues/2134#note_33695139
Edited by James Lopez