[Meta] Observations during the 8.15.2 release
List of things that I observed or went wrong during the 8.15.2 release.
- We entered the after Christmas week with both Release manager and deputy release manager being OOO
- We did not have anyone assigned who can make a call and do a release
- We had around 25 merged MR's with ~"Pick into Stable" label
- We had couple of merge requests with the label without the milestone
- Merge request that created problems originally, had a ~"Pick into Stable" label: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8286#note_20461082 . This caused problems with the release down the road
- CE EE specs take around an hour to finish, maybe a more
- Release tool clones EE, CE and omnibus-gitlab. It also syncs master even when doing a release. https://gitlab.com/gitlab-org/release-tools/issues/43
- Deploy to staging takes time
- Mailroom fails on deploy regularly
- Package installation is slow during the unpack phase
- Staging is in an inconsistant state so most pages you load when you login will give you a 500 page https://gitlab.com/gitlab-com/infrastructure/issues/936
- The MR that got picked to stable (mentioned above) caused issues and more errors in the logs
- We were 30 minutes in after the packages got released to the wide public. Due to errors that the version was causing we decided to yank the release
- Yanking packages took 5 minutes. => Could possibly be solved by https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1830
- There is a bug where OpenSuse packages cannot be removed from the package server, not from the UI nor using the API => Contacted Packagecloud with the bug report.
- We removed the docker images from dockerhub
- We did not re-tag the
latest
tag which caused issues for users with docker - We removed all the 8.15.2 tags from repositories to "prevent" users installing from source to upgrade
- We reverted the offending commit and decided to do a new 8.15.2 release
- Handoff between the US and EU timezones was clearly communicated: https://gitlab.slack.com/archives/releases/p1482890817002008
- We re-tagged the
latest
docker tag to point to 8.15.1 - In order to push the new suse packages, it was necessary to go into packages.gitlab.com rails console and trigger a individual delete of the suse 8.15.2 packages
- Building packages on a cold cache takes 50 minutes: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1472
- In order to not build the same package again, it was necessary to manually fetch the built packages and push them using the packagecloud API
- In order to build the Docker image (which is in an extra step), it was necessary to go into rails console on dev.gitlab.org and change the values of pipeline to trigger the builds that were marked as "skipped": https://gitlab.com/gitlab-org/gitlab-ce/issues/4054
- Deploy to staging again showed the same symptoms
- We do not have documented in the deploy documentation that offline migrations need to be ran after the deploy is done. There was a offline migration on staging that was not run from 8.15.1 deploy: https://gitlab.com/gitlab-com/infrastructure/issues/938
- Deploy to production took 25 minutes. 2 minutes for rake cache:clear: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1826
- Some workers did not restart Unicorn correctly so manual restart was necessary
- Blog post was merged but it did not get published https://gitlab.com/gitlab-org/gitlab-ce/issues/26159, fixed with https://gitlab.com/gitlab-com/www-gitlab-com/commit/89a4c07d2b4b522b7dd06dd5c97f9275a01584c5
- Unrelated to 8.15.2 release, but we might have an issue that has data loss: https://gitlab.com/gitlab-org/gitlab-ce/issues/26158