Next steps to use existing canary deployment

Drop

The package gets deployed on the canary (e.g. via ChatOps). The package can be rolled back if necessary.

We can deploy manually.

An omnibus package gets built for EE for the latest master.

This is easy to do now, we have a way of triggering builds through CI. This can be even automatic if we decide to do it, it is not complicated to change.

The problem is depending on our mirroring inside GitLab which has been unreliable to say the least.

Accoring to this comment https://gitlab.com/gitlab-com/infrastructure/issues/1504#note_31154363 we can also drop

2 . A merge request gets automatically created for EE master for review.

What we need is that when someone is building something, this something gets an EE package created that gets deployed.

No need to overcomplicate things.

Couldn't we simplify the first 4 steps by just using an EE nightly package? What the state of EE is compared to CE is basically irrelevant for these purposes: it only matters for a release.

What did we decide about DB migrations?

What did we decide about DB migrations?

@smcgivern I think that you just hit the nail here, that's the only discussion left, the rest can be done even manually.

So, in other words the proposal to use it is:

solve the "what do we do with DB migrations" (that are disabled in canary)
have an actual version that we can deploy in canary.
create the package as @marin has said it is possible using our own CI.
deploy it using the manual deployment process that we already have.

From the build and infrastructure side of things it is possible to start using canary as of today. When can we deploy the first version?

solve the "what do we do with DB migrations" (that are disabled in canary)

@pcarranza Relevant discussion pertaining to this is at https://gitlab.com/gitlab-com/infrastructure/issues/1924#note_31212043, as you've seen. Basically, by running the pre-deployment migrations and not the post-deployment migrations, the DB gets into a state where both old and new code can interact with it as usual.

One issue is that right now, pre-deployment migration rollback only works if the post-deployment migration has also ran and is rolled back at the same time, but per @yorickpeterse in https://gitlab.com/gitlab-com/infrastructure/issues/1924#note_31212665, this is fixable.

@pcarranza I think the main issue left is Sidekiq versioning. How do we get Sidekiq workers to only execute Sidekiq jobs they actually know how to process and/or how do we write Sidekiq jobs so that that doesn't matter most of the time?

With regard to solving the Sidekiq problem, I see a couple of ways:

Use different queue names for every Sidekiq worker (e.g. mailers1, mailers2, etc.) (annoying)
Use a special Redis cluster/DB for Sidekiq in the canary deployment

The latter will be possible once we have support for splitting the cache and Sidekiq from a single Redis instance.

I suppose we could still just run a separate Redis instance without the split, but the cache data could grow quite big.

@DouweM

How do we get Sidekiq workers to only execute Sidekiq jobs they actually know how to process

Perhaps a description of the payload needs to be part of the queue name. (similar to the version portion of ProtocolBuffers) and the workers can then pull from queues whose names match data they know how to process.

@stanhu

Use different queue names for every Sidekiq worker [...] (annoying)

worse than annoying ... sounds like a lot of manual configuration until there is some dynamic service discovery mechanism where the workers can advertise what their capabilities are, and the app can push into the correct queue based on that.

Use a special Redis cluster/DB for Sidekiq in the canary deployment

Doesn't this pretty much need to be done anyway? Consider the case where a controller in the canary has a different model (newer version) than the production nodes and stuffs a copy into the Rails.cache (which is backed by Redis) ... what prevents one of the prior FE nodes from pulling a model instance out of a shared Rails.cache that it does not fully understand?

See also: Issue gitlab-org/gitlab-ce#33182 (related)

Perhaps a description of the payload needs to be part of the queue name. (similar to the version portion of ProtocolBuffers) and the workers can then pull from queues whose names match data they know how to process.

I don't think Sidekiq really works well with this model. If you wanted to do this, you would have to have a worker pull from the queue and requeue the job if it doesn't match. That seems quite awkward and could lead to unintended consequences.

@stanhu

you would have to have a worker pull from the queue and requeue the job if it doesn't match

there is no need for re-queueing if the queue name (the Redis key) includes a description of the data and version ... workers would only pull from queues whose names they recognize.

At present moment I see in gitlab-ce/master there are 5 Sidekiq queue names in use [+31 dedicated workers]

cronjob
repository_check
build
pipeline
pages
[N=31] dedicated workers, formatted as name.sub(/Worker\z/, '').underscore.tr('/', '_')

Each of those queue names above could be suffixed with the payload type+version.

How is that different from the first idea I proposed?

How is that different from the first idea I proposed?

If I understood your first idea correctly "(e.g. mailers1, mailers2, etc.)", your idea had a separate number suffix for each worker ... a single number does not convey the capability without some sort of additional binding work.

With the suffix as "payload type + version" (both of which could correspond to ActiveRecord model+version) it does not need dynamic service discovery to perform the bindings.

ps: there are 36 sites in the gitlab-ce/master code, 36 queues listed in config/sidekiq_queues.yml, and 57 spec test cases???

Yes, the number was meant to convey a version. I still don't think that's a great idea because you could easily get into a situation where jobs get stuck in some queue that's never pulled.

the number was meant to convey a version

Ahh ... I previously read it as instance number rather than version number

get into a situation where jobs get stuck in some queue that's never pulled.

In large scale orchestrations I have seen, you need:

service providers for later versions also pull from queues of older versions, and have internal adapter-pattern objects to reformat the data at runtime from one version to another (no separate migration step for data already in the queues). The adapter-pattern object can generally re-use the DB migration code and Rails framework.
a monitoring console for queue depth and max_age -- with alerts for too old, or too big

@stanhu don't both of your suggestions prevent the canary from picking up Sidekiq jobs scheduled from non-canaries? We want that to be possible, both because the new version should work with jobs scheduled by the old version, and because it allows us to get a greater variety of jobs run by the canary.

On migrations: I don't see any mention of staging-before-canary, but surely that's essential for testing the timings and likely impact? Will we have a canary in staging, or just use regular staging?

don't both of your suggestions prevent the canary from picking up Sidekiq jobs scheduled from non-canaries? We want that to be possible, both because the new version should work with jobs scheduled by the old version, and because it allows us to get a greater variety of jobs run by the canary.

@smcgivern Right, the first idea might make it possible if the canary can pick up jobs of older versions. Do you have other suggestions?

@stanhu nothing in particular, other than manual inspection of git diff $production...$canary app/workers - my suspicion is that most of the time they could work together just fine

@smcgivern @stanhu I think that we do need to get to the point where canary and standard can work together, that will enable us to deploy many more versions in production without impacting.

@smcgivern I think that we should not have a canary in staging, if we want we could be deploying in staging, but mind the fact that canary should have a compatible model, or that the application should support having canary there running and then having it go away.

One of the core values of a canary is that it may fail hard, which should be perfectly fine for the whole application, imagine this:

we spin up a canary with a new compatible version.
it fails miserably raising a lot of errors, but writes something to the database or sidekiq.
we stop it and tear it down to go fix the errors.

We should reach a point where we are confident that we don't need staging anymore and that we can deploy into canary directly, so, adding yet another step would be going the other way.

@pcarranza is the rest of that plan written down anywhere? I think I missed something while I was away, and that seems like it would change our release process quite a bit.

@smcgivern the whole plan for proper canary deployment is written down here: https://gitlab.com/gitlab-com/infrastructure/issues/1504

This is the continuation from an issue that is taking a lot of shortcuts to enable 1 canary deployment in production that is already delivered from the production and build side: https://gitlab.com/gitlab-com/infrastructure/issues/1924

I think that @stanhu is trying to open the discussion of handling whatever is left to deploy the first canary in production, that, as stated before, is already available as a resource.

@pcarranza thanks, that seems to talk about creating more staging envs, not fewer - although that might be a later phase.

Right, this issue is specifically targeted to making use of the canary now that we have it.

For example, RC1 is supposed to be out sometime this week. I could see us from trying it out on staging to verify nothing blows up. If that passes, could we run this on the canary host this week?

Just looking at the diff between 9-2-stable and master, I already see that we introduced new Sidekiq workers (e.g. NamespacelessProjectDestroyWorker and RemoveOldWebHookLogsWorker). The former won't cause an issue because it uses its own dedicated queue. However, the latter one is used in the CronjobQueue, so that any OLD Sidekiq worker attempting to run it will fail.

for running RC1 on the canary if possible :-)

I'm in too, if we can run it on canary then we could also be working on canary ourselves all this time and have more time to spot regressions.

@smcgivern yeah, the whole scope of a proper canary is much much much larger than just a single host. This is a shortcut for quite a lot of things.

I see a number of pitfalls with deploying RC1 to the canary:

We have one migration (20170503140201_reschedule_project_authorizations.rb) that reschedules project authorizations for all users in a Sidekiq job. We don't want this job to be picked up by older workers.
We have new Sidekiq workers that will be unknown to older workers, which will cause the job to fail and retried later

Note that we drop the authorized_projects_populated column in a post-migration deployment, which is needed by older app servers. So we shouldn't run the post migrations until the full fleet is deployed (which is true regardless of the canary).

The quickest way forward would be to have a separate Redis instance just for the canary. However, that runs the risk we'll schedule Sidekiq jobs that can only be picked up by the canary.

@stanhu my concern there would be that we would not be having both environments running on the same production data.

A thing to not though is that currently canary has sidekiq completely disabled due to some things not behaving correctly. We could deploy in canary and not process the tasks, and maybe only trigger the new queues that belong to the new canary with sidekiq-cluster, that way we would be testing the new things, just manually for now.

@pcarranza @ernstvn @stanhu

This entire thread and related issues (#1924 (closed), #1504) are a bold step in a scalable direction.

I am uncomfortable with what is not being discussed in this thread with the primary issue being the potential for data corruption when the new code does not visibly fail, yet has a different semantic understanding of the data from the old code. (has it been discussed on Slack? or another issue?)

One example is "lib/gitlab/current_settings.rb" -- even if there is no schema change, one may not desire for the new node to propagate its default settings (which may alter existing feature toggles, other configs) to the old nodes -- and, right now, it will -- in certain edge cases involving unplanned network partitioning.

In my experience with larger orchestrations and move to highly scalable microservices, everything needs to be versioned ... views, models (schema, data-in-motion and data-at-rest), controllers, AND most importantly -- the interfaces between them -- and havoc or general meltdown in velocity of new features ensued when those principles were not observed. I would be sad to see the scramble and the (unnecessary) heroism needed to triage and repair those sort of issues.

In the current GitLab deployment, as a big monolith, at least the versions stay in sync once the migration of persistent storage is done. (Are the existing migrations migrating objects stored in Redis?)

There is a lot of additional scaffolding needed to support "safer" canary testing in production, depending on how much risk your organization (include sales and support in the dialog) is willing to take for the sake of deploying new features. Starting with: (1) parallel execution of idempotent operations in the live-stream of HTTP request data with old/new versions of MVC and compare results along a set of metrics that include performance and resource consumption as well as functionality, (2) scaffolding to duplicate a portion of the live-stream into canaries (based on user/group/namespace/project/random selection), (3) adaptor objects to allow controllers of one generation talk to model instances of a different generation.

The strategy I have seen work multiple times is to first invest in sharding the data/transactions in middleware before the Controllers are invoked, and then to allow migration of an MVC monolith for a single shard to be performed independently of all other shards -- each shard gets its own instance of the existing monolith -- and, with this approach, one can build an A/B canary of a fully functional shard with all of its data without risk to all of the other shards. This initial investment gives a lot of runway for the app team to start breaking out smaller services from the original MVC monolith.

There's a lot more to be discussed, depending on the direction the team wishes to take.

Any updates on the thinking here @DouweM @smcgivern @stanhu ?

Would be great to move ahead with using the canary, if we can't use it well when it is one... how will we use it when we have N?

Right now, I think we might be able to use the canary for changes that don't involve changes in Sidekiq workers. We'd also have to look at whether there are new migrations. Perhaps the simplest thing to do is try to see if we can use it for 9.3 RC3 and subsequent release candidates.

OK, sounds good @stanhu . Pinging @ClemMakesApps @kushalpandya @omame and @jameslopez as people involved in the release management / deploy management.

Cool, this will be fun to deploy for RC3

I see that the canary is using 9.2.2 now. I think we need to upgrade it to 9.3.0 now.

Now that we have this, I suggest we aim for a nightly build this week and install it on the canary. See if this helps us catch more issues sooner.

Who is using the canary? I mean from the users perspective?

Nobody at the moment, as far as I know.

Then how do we catch issues sooner? Place 50% of GitLab team on the canary by default?

I think the first step is to start getting comfortable with deploying on canary at a frequent basis. When it becomes something we can use, we can find ways to route people automatically to the canary.

I don't think anyone will get comfortable until we are using it :-) Release Managers perhaps?

9.3.1 will be out today with no migrations. Maybe a perfect time to use the canary instead of staging?

@ClemMakesApps @kushalpandya are you game for this?

If someone is willing to walk me through this, I'd be down but I'm almost ready to tag and deploy to staging

@ClemMakesApps In looking at the Chef repo and Rakefile, I think instead of staging just replace that with canary. For example:

bundle exec rake deploy[canary,9.3.1-ee.0]

Thanks @stanhu @ClemMakesApps @ernstvn this will help us tremendously.

we have to adapt the deploy rake task so it skips over the gitaly chef run in canary:

✖ Running chef-client on gitlab-canary-git-data-storage (2.17 sec)
FATAL: No nodes returned from search
rake aborted!
Failed to execute command: bundle exec knife ssh -a ipaddress 'roles:gitlab-canary-git-data-storage' 'sudo chef-client'
/Users/clement/Open/chef-repo/Rakefile:110:in `block in run_with_progress'
/Users/clement/Open/chef-repo/Rakefile:76:in `run_with_progress'
/Users/clement/Open/chef-repo/Rakefile:121:in `run_command_on_roles'
/Users/clement/Open/chef-repo/Rakefile:339:in `block (3 levels) in <top (required)>'
/Users/clement/Open/chef-repo/Rakefile:197:in `yield_and_wait_until_reload'
/Users/clement/Open/chef-repo/Rakefile:338:in `block (2 levels) in <top (required)>'
/Users/clement/Open/chef-repo/Rakefile:337:in `each'
/Users/clement/Open/chef-repo/Rakefile:337:in `block in <top (required)>'
/Users/clement/.rvm/gems/ruby-2.3.2/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'
/Users/clement/.rvm/gems/ruby-2.3.2/bin/ruby_executable_hooks:15:in `eval'
/Users/clement/.rvm/gems/ruby-2.3.2/bin/ruby_executable_hooks:15:in `<main>'
Tasks: TOP => deploy
(See full trace by running task with --trace)

From the error it seems we are in Line 339 of the rake file, however the gitaly role should already be removed from the roles being iterated over here: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/Rakefile#L311

I added the exception for canary here: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/Rakefile#L335

but since the error is 4 lines bellow i am a bit stumped.

Due to the late hour and our lack of fresh eyes, we broke the deploy off.

Ideally, we would deploy 9.4 RC1 to the canary and run that for a day or so before rolling out to the entire fleet. Is this feasible? I just did a scan of impending 9.4 changes in EE and CE compared to 9.3.5:

There are no new Sidekiq workers
There are a few columns that are removed (position in merge_request_diffs and position in issue_metrics` tables)

I thought I saw that in my GDK, the removal of the position column caused problems for me. But let's say the removal of the column is no trouble for the rest of the fleet. Could we potentially then run RC1 on the canary node for 24 hours? Are there other changes that we need to consider?

/cc: @jamedjo, @mikegreiling

I also looked around for EE nightly packages here (https://packages.gitlab.com/gitlab/unstable), and even tried to use apt-cache madison to see if I could find the most recent copy to no avail. Are our EE nightly packages being built?

@stanhu We have an issue with how our CI artifacts work, there seems to be a regression. If we don't find it and fix it, our release will be at danger. I am looking where the issue is, and will look for CI team for help.

@marin Is this definitely the ApplicationSettings bug in https://gitlab.com/gitlab-org/gitlab-ce/issues/34728, or could it be something else? Just so I know if we're safe once that is fixed

@jamedjo Yes, that ended up being the culprit.

Ok, it looks like our nightly builds for EE are now showing up. Great!

One issue that concerns me is that the package names are not right: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/864 For example, last night's build is prefaced with gitlab-ee-8.1.0+git.3298.915a987.56671-rc1.ce.0. That actually looks like a downgrade from 9.4, and the ee mixed in with ce also is confusing.

@rymai @rspeicher Can we start preparing what we need to do to deploy daily canary deployments?

Can we start preparing what we need to do to deploy daily canary deployments?

@stanhu Do we want to deploy them automatically? Via ChatOps? Manually?

Is the original description up-to-date? Especially

A merge request gets automatically created for EE master for review.

Do we want to start by just using the current EE master without automatizing the creation of EE MRs, as suggested by @smcgivern in https://gitlab.com/gitlab-com/infrastructure/issues/1944#note_31397727:

Couldn't we simplify the first 4 steps by just using an EE nightly package? What the state of EE is compared to CE is basically irrelevant for these purposes: it only matters for a release.

One issue that concerns me is that the package names are not right

Also, I think this issue is still not resolved: the latest nightly package named gitlab-ee-8.1.0+git.3353.19a72ff.57769-rc1.ce.0.el7.x86_64.rpm.

changed the description

Updated the issue. I agree with just using the nightly package for now. I think @marin has a plan for tackling https://gitlab.com/gitlab-org/omnibus-gitlab/issues/864.

@rspeicher https://gitlab.com/gitlab-org/omnibus-gitlab/issues/864 has been solved by https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/1835#note_37070857. Will you have some time to look into what we need to do to deploy our nightly packages to Canary daily?

Will do.

I was thinking about this more. Let's say:

August 22 - We release 9.5.0
August 23 - We deploy EE master to canary

This canary will now have migrations not present in 9.5.0. I presume that we still want to update GitLab.com with 9.5.x in forthcoming releases. That means the canary should really just reflect master, and we shouldn't update other nodes to use the same version.

We have to make sure database migrations are truly backwards compatible. If something goes wrong, we either have to revert the DB changes or upgrade the whole fleet to the EE nightly. We don't have a good mechanism to revert the DB changes right now.

Assuming we can pull that off, that means post-deployment migrations (e.g. to remove a column) will never run from the 23rd to September 22.

Lastly, I think we will have to figure out a plan to handle Sidekiq versioning issues to make this work.

@stanhu What is needed to implement the plan you outline:

the canary should really just reflect master
make sure database migrations are truly backwards compatible
make a good mechanism to revert the DB changes
figure out a plan to handle Sidekiq versioning issues

First, do we need all of these points for a nightly deploy? Second, do we know how to do each of the needed points? And third, do we have issues and timelines to resolve each of the needed points?

First, do we need all of these points for a nightly deploy? Second, do we know how to do each of the needed points? And third, do we have issues and timelines to resolve each of the needed points?

I think we do. We should create separate issues.

the canary should really just reflect master

Yes, because the nightly builds use master.

make sure database migrations are truly backwards compatible

This should be the case with the zero-downtime deploys, but it's possible we hit edge cases where something goes wrong.

make a good mechanism to revert the DB changes

Yes, let's say master introduces 10 new database migrations. Right now to roll them back, we'd have to identify which ones were introduced/breaking the current deploy, and then manually roll them back. We could create some tooling around this to make it less painful.

figure out a plan to handle Sidekiq versioning issues

Yes, that's an important one. There are lots of good ideas discussion here about this.

We have to make sure database migrations are truly backwards compatible. If something goes wrong, we either have to revert the DB changes or upgrade the whole fleet to the EE nightly. We don't have a good mechanism to revert the DB changes right now.

This also means that we can't remove a column in a single MR: we need to ignore it, merge that, and then create a separate MR to actually remove it. (As obviously we can't roll back the data otherwise.)

How long do we need to leave between those two steps? They can happen in a single release, but not in a single nightly deploy. Is a day enough of a gap?

Day is enough of a gap in most cases. But we do have cases when our mirroring fails to sync repos and then we get all updates at once when it gets noticed/resolved.

added blocked label

mentioned in issue #2606 (closed)

changed the description

mentioned in issue #2607 (closed)

changed the description

marked this issue as related to #2607 (closed)

I created 4 follow-up issues, updated the body of the issue description here, and marked this issue as Blocked. @stanhu I'd appreciate your views on each of the four new issues in terms of who should "own" it, which product manager should be involved, and so forth.

@stanhu how possible is it to deploy nightly versions automatically in Canary to ensure that our application changes are backwards compatible with the old schema?

For getting canary working correctly, I believe the follwing issues are the most important.

https://gitlab.com/gitlab-org/gitlab-ce/issues/36911 (Db migrations backwards compatible)
https://gitlab.com/gitlab-org/gitlab-ce/issues/36912 (revert DB changes)
https://gitlab.com/gitlab-org/gitlab-ce/issues/37292 (Sidekiq versioning)

@ernstvn , @stanhu am I missing any?

Correct @sitschner . Those are the ones that I had listed in the body of the issue to keep that as the SSoT.

Next steps to use existing canary deployment

UPDATE:

Designs

Child items ...

Activity

Admin message

Admin message

Next steps to use existing canary deployment

UPDATE:

Relates to

Activity