9.4.0 RC2 deployment post-mortem
We encountered several issues while attempting to deploy RC2 to production.
- The deploy call started around 17:00 UTC with @northrup, @jamedjo, @mikegreiling, and @jivanvl on the call.
- While running the deploy rake task the first time, the db migrations went way faster than expected and then it failed with
Chef::Exceptions::User: Cannot modify user terraform - does not exist!
.- Issue with summary on terraform user: https://gitlab.com/gitlab-com/infrastructure/issues/2239 (wip, internal)
- https://gitlab.slack.com/archives/C101F3796/p1499966179703943
- fix: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/903
- @ilyaf, @stanhu, @twk3, and @omame joined the call.
- In addition to the terraform user issue, it appeared that the deploy node was not installing the right package version.
- @omame corrected this by updating some chef roles (workaround commit: https://dev.gitlab.org/cookbooks/chef-repo/commit/69a90e2831de82505bae38b64d044b8707a83a49)
- workaround currently requires manual intervention during deploy. Change for automating it is in review: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/904
- 19:00 UTC We ran the deploy task again and it failed in the middle of a db migration that was adding a bunch of foreign keys.
- related: gitlab-org/gitlab-ee!2223
- apparently sidekiq was not stopped before the migration and @stanhu noticed that a project was deleted sometime during the migration triggering a race condition.
- https://gitlab.slack.com/archives/C101F3796/p1499971870946796
- We disabled sidekiq and logged into deploy.gitlab.com to revert migration manually and then re-run it. At some point @yorickpeterse was brought in.
- 20:00 UTC We then enabled sidekiq again after we thought the problematic migrations were finished. Turns out this was premature and we needed to disable sidekiq again. (I believe @ilyaf handled this)
- We finally completed the manual db migration run on deploy.gitlab.com and switched back to the chef repo to deploy using the rake task. We commented out several lines to get it to bypass steps that had already been completed and proceeded to run it again.
- 20:45 We encountered an error with
PrometheusHandler
that caused the deploy task to fail again.ERROR: PrometheusHandler: #<TypeError: no implicit conversion of Hash into String> 52.179.183.168 - PrometheusHandler
- https://gitlab.slack.com/files/mgreiling/F69FABHFH/latest_fail_log.txt
- Ultimately we ended up bypassing this and moving on to the post-deploy migrations.
- 21:55 The package installations on a few of the nodes took way longer than expected. Affected nodes:
sidekiq-asap0[134]
, issue with possible reasons: https://gitlab.com/gitlab-com/infrastructure/issues/2246 (needs investigation) - GitLab.com started displaying 500 errors. PagerDuty was triggered. The database reached a max connection limit (I think?). @northrup investigated, noted that one DB CPU was maxed out:
Some fiddling with sidekiq relieved this.psql: FATAL: sorry, too many clients already
@yorickpeterse investigated and couldn't find a definitive culprit. Other than the high DB load, there was not that much unusual activity in the monitoring graphs or PostgreSQL logs.
- 22:30 RC2 deployment complete. All deployment issues resolved.
This is a very broad gist of my memory. Please fill in the details where appropriate and note any issues that ought to be opened to improve the process for next time.
Merge request validation/license issues
- 21:08 UTC We notice MRs failing to create in https://sentry.gitlap.com/gitlab/gitlabcom/issues/37315/, but don't realise that this could be preventing all merge request creation
- 21:34 UTC @fatihacet first notices Approver validation prevents MR edits https://gitlab.slack.com/archives/C02PF508L/p1499981659622754
- 23:39 UTC An increasing number of users are unable to create merge requests due to https://gitlab.com/gitlab-org/gitlab-ce/issues/35077
- Attempts are made to diagnose the problem, but it didn't appear on staging and doesn't occur on all local installs.
- https://gitlab.slack.com/archives/C02PF508L/p1499989184983550
- https://gitlab.slack.com/archives/C02PF508L/p1499991325479615
- 00:13 @jamedjo notices that the 'Related Issues' box which appeared on dev and staging isn't appearing on issues in production https://gitlab.slack.com/archives/C0XM5UU6B/p1499991232607604
- Around 00:30 UTC we identify a change which causes MRs to throw 500 errors if validation fails
- 00:46 UTC We identify that MR creation validation fails because of approvers https://gitlab.slack.com/archives/C02PF508L/p1499993207856587?thread_ts=1499991325.479615&cid=C02PF508L
- 01:15 @nick.thomas Suggests that 'Related Issues' not appearing, 'Multiple issue boards' not working and the merge requests failing due to approvers could all be related https://gitlab.slack.com/archives/C02PF508L/p1499994918184243?thread_ts=1499991325.479615&cid=C02PF508L
- 01:27 UTC - @stanhu Uploads new license to GitLab.com, and MRs start working again
This means merge requests couldn't be created/updated for 4h 19mins
Edited by Stan Hu