In #1924 (closed), we now have a set of canary machines that we should be able to deploy iterations before they hit an RC. What do we need to do to use this?
One complication discussed in Slack is that Sidekiq jobs may need to be versioned in some way, since it's possible for the Rails process to schedule new jobs or change parameters for an existing worker that are not present in older versions.
Here is a strawman proposal for how we might use this canary deployment in practice:
Use the nightly EE omnibus package
The package gets deployed on the canary (e.g. via ChatOps). The package can be rolled back if necessary.
Alerts for the canary get sent out to flag possible issues.
What do we need to do address each point?
UPDATE:
Per discussion below, the following four points need to be addressed to unblock this:
An omnibus package gets built for EE for the latest master.
This is easy to do now, we have a way of triggering builds through CI. This can be even automatic if we decide to do it, it is not complicated to change.
The problem is depending on our mirroring inside GitLab which has been unreliable to say the least.
Couldn't we simplify the first 4 steps by just using an EE nightly package? What the state of EE is compared to CE is basically irrelevant for these purposes: it only matters for a release.
@pcarranza I think the main issue left is Sidekiq versioning. How do we get Sidekiq workers to only execute Sidekiq jobs they actually know how to process and/or how do we write Sidekiq jobs so that that doesn't matter most of the time?
How do we get Sidekiq workers to only execute Sidekiq jobs they actually know how to process
Perhaps a description of the payload needs to be part of the queue name. (similar to the version portion of ProtocolBuffers) and the workers can then pull from queues whose names match data they know how to process.
Use different queue names for every Sidekiq worker [...] (annoying)
worse than annoying ... sounds like a lot of manual configuration until there is some dynamic service discovery mechanism where the workers can advertise what their capabilities are, and the app can push into the correct queue based on that.
Use a special Redis cluster/DB for Sidekiq in the canary deployment
Doesn't this pretty much need to be done anyway? Consider the case where a controller in the canary has a different model (newer version) than the production nodes and stuffs a copy into the Rails.cache (which is backed by Redis) ... what prevents one of the prior FE nodes from pulling a model instance out of a shared Rails.cache that it does not fully understand?
See also: Issue gitlab-org/gitlab-ce#33182 (related)
Perhaps a description of the payload needs to be part of the queue name. (similar to the version portion of ProtocolBuffers) and the workers can then pull from queues whose names match data they know how to process.
I don't think Sidekiq really works well with this model. If you wanted to do this, you would have to have a worker pull from the queue and requeue the job if it doesn't match. That seems quite awkward and could lead to unintended consequences.
you would have to have a worker pull from the queue and requeue the job if it doesn't match
there is no need for re-queueing if the queue name (the Redis key) includes a description of the data and version ... workers would only pull from queues whose names they recognize.
At present moment I see in gitlab-ce/master there are 5 Sidekiq queue names in use [+31 dedicated workers]
cronjob
repository_check
build
pipeline
pages
[N=31] dedicated workers, formatted as name.sub(/Worker\z/, '').underscore.tr('/', '_')
Each of those queue names above could be suffixed with the payload type+version.
How is that different from the first idea I proposed?
If I understood your first idea correctly "(e.g. mailers1, mailers2, etc.)", your idea had a separate number suffix for each worker ... a single number does not convey the capability without some sort of additional binding work.
With the suffix as "payload type + version" (both of which could correspond to ActiveRecord model+version) it does not need dynamic service discovery to perform the bindings.
ps: there are 36 sites in the gitlab-ce/master code, 36 queues listed in config/sidekiq_queues.yml, and 57 spec test cases???
Yes, the number was meant to convey a version. I still don't think that's a great idea because you could easily get into a situation where jobs get stuck in some queue that's never pulled.
Ahh ... I previously read it as instance number rather than version number
get into a situation where jobs get stuck in some queue that's never pulled.
In large scale orchestrations I have seen, you need:
service providers for later versions also pull from queues of older versions, and have internal adapter-pattern objects to reformat the data at runtime from one version to another (no separate migration step for data already in the queues). The adapter-pattern object can generally re-use the DB migration code and Rails framework.
a monitoring console for queue depth and max_age -- with alerts for too old, or too big
@stanhu don't both of your suggestions prevent the canary from picking up Sidekiq jobs scheduled from non-canaries? We want that to be possible, both because the new version should work with jobs scheduled by the old version, and because it allows us to get a greater variety of jobs run by the canary.
On migrations: I don't see any mention of staging-before-canary, but surely that's essential for testing the timings and likely impact? Will we have a canary in staging, or just use regular staging?
don't both of your suggestions prevent the canary from picking up Sidekiq jobs scheduled from non-canaries? We want that to be possible, both because the new version should work with jobs scheduled by the old version, and because it allows us to get a greater variety of jobs run by the canary.
@smcgivern Right, the first idea might make it possible if the canary can pick up jobs of older versions. Do you have other suggestions?
@stanhu nothing in particular, other than manual inspection of git diff $production...$canary app/workers - my suspicion is that most of the time they could work together just fine
@smcgivern@stanhu I think that we do need to get to the point where canary and standard can work together, that will enable us to deploy many more versions in production without impacting.
@smcgivern I think that we should not have a canary in staging, if we want we could be deploying in staging, but mind the fact that canary should have a compatible model, or that the application should support having canary there running and then having it go away.
One of the core values of a canary is that it may fail hard, which should be perfectly fine for the whole application, imagine this:
we spin up a canary with a new compatible version.
it fails miserably raising a lot of errors, but writes something to the database or sidekiq.
we stop it and tear it down to go fix the errors.
We should reach a point where we are confident that we don't need staging anymore and that we can deploy into canary directly, so, adding yet another step would be going the other way.
@pcarranza is the rest of that plan written down anywhere? I think I missed something while I was away, and that seems like it would change our release process quite a bit.
This is the continuation from an issue that is taking a lot of shortcuts to enable 1 canary deployment in production that is already delivered from the production and build side: https://gitlab.com/gitlab-com/infrastructure/issues/1924
I think that @stanhu is trying to open the discussion of handling whatever is left to deploy the first canary in production, that, as stated before, is already available as a resource.
Right, this issue is specifically targeted to making use of the canary now that we have it.
For example, RC1 is supposed to be out sometime this week. I could see us from trying it out on staging to verify nothing blows up. If that passes, could we run this on the canary host this week?
Just looking at the diff between 9-2-stable and master, I already see that we introduced new Sidekiq workers (e.g. NamespacelessProjectDestroyWorker and RemoveOldWebHookLogsWorker). The former won't cause an issue because it uses its own dedicated queue. However, the latter one is used in the CronjobQueue, so that any OLD Sidekiq worker attempting to run it will fail.
I see a number of pitfalls with deploying RC1 to the canary:
We have one migration (20170503140201_reschedule_project_authorizations.rb) that reschedules project authorizations for all users in a Sidekiq job. We don't want this job to be picked up by older workers.
We have new Sidekiq workers that will be unknown to older workers, which will cause the job to fail and retried later
Note that we drop the authorized_projects_populated column in a post-migration deployment, which is needed by older app servers. So we shouldn't run the post migrations until the full fleet is deployed (which is true regardless of the canary).
The quickest way forward would be to have a separate Redis instance just for the canary. However, that runs the risk we'll schedule Sidekiq jobs that can only be picked up by the canary.
@stanhu my concern there would be that we would not be having both environments running on the same production data.
A thing to not though is that currently canary has sidekiq completely disabled due to some things not behaving correctly. We could deploy in canary and not process the tasks, and maybe only trigger the new queues that belong to the new canary with sidekiq-cluster, that way we would be testing the new things, just manually for now.
This entire thread and related issues (#1924 (closed), #1504) are a bold step in a scalable direction.
I am uncomfortable with what is not being discussed in this thread with the primary issue being the potential for data corruption when the new code does not visibly fail, yet has a different semantic understanding of the data from the old code. (has it been discussed on Slack? or another issue?)
One example is "lib/gitlab/current_settings.rb" -- even if there is no schema change, one may not desire for the new node to propagate its default settings (which may alter existing feature toggles, other configs) to the old nodes -- and, right now, it will -- in certain edge cases involving unplanned network partitioning.
In my experience with larger orchestrations and move to highly scalable microservices, everything needs to be versioned ... views, models (schema, data-in-motion and data-at-rest), controllers, AND most importantly -- the interfaces between them -- and havoc or general meltdown in velocity of new features ensued when those principles were not observed. I would be sad to see the scramble and the (unnecessary) heroism needed to triage and repair those sort of issues.
In the current GitLab deployment, as a big monolith, at least the versions stay in sync once the migration of persistent storage is done. (Are the existing migrations migrating objects stored in Redis?)
There is a lot of additional scaffolding needed to support "safer" canary testing in production, depending on how much risk your organization (include sales and support in the dialog) is willing to take for the sake of deploying new features. Starting with: (1) parallel execution of idempotent operations in the live-stream of HTTP request data with old/new versions of MVC and compare results along a set of metrics that include performance and resource consumption as well as functionality, (2) scaffolding to duplicate a portion of the live-stream into canaries (based on user/group/namespace/project/random selection), (3) adaptor objects to allow controllers of one generation talk to model instances of a different generation.
The strategy I have seen work multiple times is to first invest in sharding the data/transactions in middleware before the Controllers are invoked, and then to allow migration of an MVC monolith for a single shard to be performed independently of all other shards -- each shard gets its own instance of the existing monolith -- and, with this approach, one can build an A/B canary of a fully functional shard with all of its data without risk to all of the other shards. This initial investment gives a lot of runway for the app team to start breaking out smaller services from the original MVC monolith.
There's a lot more to be discussed, depending on the direction the team wishes to take.
Right now, I think we might be able to use the canary for changes that don't involve changes in Sidekiq workers. We'd also have to look at whether there are new migrations. Perhaps the simplest thing to do is try to see if we can use it for 9.3 RC3 and subsequent release candidates.
I think the first step is to start getting comfortable with deploying on canary at a frequent basis. When it becomes something we can use, we can find ways to route people automatically to the canary.
we have to adapt the deploy rake task so it skips over the gitaly chef run in canary:
✖ Running chef-client on gitlab-canary-git-data-storage (2.17 sec)FATAL: No nodes returned from searchrake aborted!Failed to execute command: bundle exec knife ssh -a ipaddress 'roles:gitlab-canary-git-data-storage' 'sudo chef-client'/Users/clement/Open/chef-repo/Rakefile:110:in `block in run_with_progress'/Users/clement/Open/chef-repo/Rakefile:76:in `run_with_progress'/Users/clement/Open/chef-repo/Rakefile:121:in `run_command_on_roles'/Users/clement/Open/chef-repo/Rakefile:339:in `block (3 levels) in <top (required)>'/Users/clement/Open/chef-repo/Rakefile:197:in `yield_and_wait_until_reload'/Users/clement/Open/chef-repo/Rakefile:338:in `block (2 levels) in <top (required)>'/Users/clement/Open/chef-repo/Rakefile:337:in `each'/Users/clement/Open/chef-repo/Rakefile:337:in `block in <top (required)>'/Users/clement/.rvm/gems/ruby-2.3.2/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'/Users/clement/.rvm/gems/ruby-2.3.2/bin/ruby_executable_hooks:15:in `eval'/Users/clement/.rvm/gems/ruby-2.3.2/bin/ruby_executable_hooks:15:in `<main>'Tasks: TOP => deploy(See full trace by running task with --trace)
Ideally, we would deploy 9.4 RC1 to the canary and run that for a day or so before rolling out to the entire fleet. Is this feasible? I just did a scan of impending 9.4 changes in EE and CE compared to 9.3.5:
There are no new Sidekiq workers
There are a few columns that are removed (position in merge_request_diffs and position in issue_metrics` tables)
I thought I saw that in my GDK, the removal of the position column caused problems for me. But let's say the removal of the column is no trouble for the rest of the fleet. Could we potentially then run RC1 on the canary node for 24 hours? Are there other changes that we need to consider?
I also looked around for EE nightly packages here (https://packages.gitlab.com/gitlab/unstable), and even tried to use apt-cache madison to see if I could find the most recent copy to no avail. Are our EE nightly packages being built?
@stanhu We have an issue with how our CI artifacts work, there seems to be a regression. If we don't find it and fix it, our release will be at danger. I am looking where the issue is, and will look for CI team for help.
Ok, it looks like our nightly builds for EE are now showing up. Great!
One issue that concerns me is that the package names are not right: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/864 For example, last night's build is prefaced with gitlab-ee-8.1.0+git.3298.915a987.56671-rc1.ce.0. That actually looks like a downgrade from 9.4, and the ee mixed in with ce also is confusing.
Couldn't we simplify the first 4 steps by just using an EE nightly package? What the state of EE is compared to CE is basically irrelevant for these purposes: it only matters for a release.
One issue that concerns me is that the package names are not right
Also, I think this issue is still not resolved: the latest nightly package named gitlab-ee-8.1.0+git.3353.19a72ff.57769-rc1.ce.0.el7.x86_64.rpm.
This canary will now have migrations not present in 9.5.0. I presume that we still want to update GitLab.com with 9.5.x in forthcoming releases. That means the canary should really just reflect master, and we shouldn't update other nodes to use the same version.
We have to make sure database migrations are truly backwards compatible. If something goes wrong, we either have to revert the DB changes or upgrade the whole fleet to the EE nightly. We don't have a good mechanism to revert the DB changes right now.
Assuming we can pull that off, that means post-deployment migrations (e.g. to remove a column) will never run from the 23rd to September 22.
Lastly, I think we will have to figure out a plan to handle Sidekiq versioning issues to make this work.
@stanhu What is needed to implement the plan you outline:
the canary should really just reflect master
make sure database migrations are truly backwards compatible
make a good mechanism to revert the DB changes
figure out a plan to handle Sidekiq versioning issues
First, do we need all of these points for a nightly deploy? Second, do we know how to do each of the needed points? And third, do we have issues and timelines to resolve each of the needed points?
First, do we need all of these points for a nightly deploy? Second, do we know how to do each of the needed points? And third, do we have issues and timelines to resolve each of the needed points?
I think we do. We should create separate issues.
the canary should really just reflect master
Yes, because the nightly builds use master.
make sure database migrations are truly backwards compatible
This should be the case with the zero-downtime deploys, but it's possible we hit edge cases where something goes wrong.
make a good mechanism to revert the DB changes
Yes, let's say master introduces 10 new database migrations. Right now to roll them back, we'd have to identify which ones were introduced/breaking the current deploy, and then manually roll them back. We could create some tooling around this to make it less painful.
figure out a plan to handle Sidekiq versioning issues
Yes, that's an important one. There are lots of good ideas discussion here about this.
We have to make sure database migrations are truly backwards compatible. If something goes wrong, we either have to revert the DB changes or upgrade the whole fleet to the EE nightly. We don't have a good mechanism to revert the DB changes right now.
This also means that we can't remove a column in a single MR: we need to ignore it, merge that, and then create a separate MR to actually remove it. (As obviously we can't roll back the data otherwise.)
How long do we need to leave between those two steps? They can happen in a single release, but not in a single nightly deploy. Is a day enough of a gap?
Day is enough of a gap in most cases. But we do have cases when our mirroring fails to sync repos and then we get all updates at once when it gets noticed/resolved.
I created 4 follow-up issues, updated the body of the issue description here, and marked this issue as Blocked. @stanhu I'd appreciate your views on each of the four new issues in terms of who should "own" it, which product manager should be involved, and so forth.
@stanhu how possible is it to deploy nightly versions automatically in Canary to ensure that our application changes are backwards compatible with the old schema?