depending on your deployment strategy for preserving cache data, you may need to open a new issue for implementation of the scripting process to split existing data into the various clusters.
Also, by default, there are 3 new redis cluster types available:
caching (of frequently used and expensive to calculate items)
queues (SideKiq)
shared state (keeps the front end nodes in sync with each other for things like browser session data)
deploy to staging.
you may want a plan for some dry runs with the actual production redis data -- to test the splitting process for performance and accuracy.
Support for multiple Redis instances in Omnibus is scheduled for 10.0 and any kind of work between now and then would be redundant, fatuous and ultimately dangerous.
I've offered the build team to help expediting the delivery as soon as there is a solid implementation strategy.
This was released in 9.5. You could use it today by configuring an environment variable to point the Redis cache to the file with that configuration as we did in https://gitlab.com/gitlab-org/gitlab-ce/issues/36514 to workaround an issue. Having omnibus support would make that a little easier, but do we really need to wait?
As a team, we need to strive to only use the things that we ship with the omnibus package. But in this issue, we are being told to walk around the omnibus package and use environment variables to inject things from the outside.
This will not be easily usable by our customers and will definitely not just work.
This leaves us in a conflict situation in which we can't make this investment unless it's explicitly signed off by someone else in the chain of command.
This also means that whatever we do today with environment variables will need to be undone and then re-done when we have omnibus ready to do it, which could mean multiple redis migrations, depending on how is it implemented down the line, we just don't know yet.
@sitschner, @stanhu do you sign off spending time in building something that will not be shipped to our customers and will only improve GitLab.com availability and performance?
We create a new config file to point to this cache, and use Chef to deploy this.
We define a Rails.env to use this.
We document this in GitLab docs if others want to do this.
In 10.0, when we have omnibus support, we:
Remove the config file and let omnibus create the config.
Remove the custom Rails.env setting.
We shouldn't need to migrate anything in Redis, just change config files. This will allows us to validate that the cache is working before we fully support in omnibus.
If you feel like we can wait another month, then it's fine to push this into 10.0. But if we think that Redis has a > 10% chance of falling over before then, then we should just do the quick workaround so that we have something running.
In chatting with @pcarranza about this topic, he pointed out that "Redis seems to be floating between 82-90G out of 110", and we've had repo and user growth of 200k and 400k respectively compared to the last time Pablo looked (not sure about the time frame on that). So it could fall over before 10.0 but it is difficult to get an actual probability on that.
So here we have a classic case of balancing A: "keep GitLab.com stable and performant" with B: "use what we ship / ship what we use" for a scenario where those two things are not fully aligned in terms of timescale. IMO, in cases where sticking to B negatively impacts A, A beats B.
That consideration, taken together with the benefit of thus also validating that the cache is working before we fully support it in omnibus (as @stanhu mentioned), lead me to vote for doing the quick workaround. But... I'm no longer Dir of Infra :-) So, @sitschner do you give this a yay (quick workaround) or nay (wait)?
This is more a note to self rather than anything else.
As part of this I'm going to create separate sentinel clusters, isolated from the redis servers. @marin kindly tested that omnibus can in fact support this configuration. The results were successful. The only requirement is that we set redis.master_ip and redis.master_address in the sentinel settings.
I'm working on the Chef roles and Terraform configuration to get the structure up in staging.
Update 2017-09-06: master_password is also required.
I just managed to split the Redis cache in staging. This is still a manual process but now I know how to configure the cookbook to inject the redis.cache.yml file in the right place.
It turns out that currently you can't change the persistence settings in Omnibus. You always write to disk using the default settings.
For the time being I can disable chef-client on redis-cache-01.db.stg.gitlab.com and continue to work on the gitlab_redis cookbook (I'm writing tests at the moment) but nothing is going to see production until that issue is solved.
I think I can remove this issue from the WoW as there is no way I will be able to close it in the next two days. Omnibus needs to support both the client-side Redis split (https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2389) and the persistence settings in Redis server (no MR yet).
Since this isn't supported in Omnibus yet, we are going to delete these servers from staging (gitlab-com-infrastructure!147 (merged)). This will save us some money until we can actually use it in omnibus.
It looks like today we've hit the resource limitations of our current Redis setup. So we decided to move on and push this to production in a semi-manual mode.
I created a new Redis cache infrastructure (3 servers, 3 sentinels) in https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/159 and manually disabled persistence (the chef-client service is stopped on the redis-cache* nodes). Then I manually added the /opt/gitlab/embedded/service/gitlab-rails/config/redis.cache.ymlon all nodes running unicorn and restarted it.
Everything seems to have recovered and the cache is flowing to a volatile Redis instance.