Redesign Redis installation
because we are wasting money, we don't need 5 Redis servers, we need 5 sentinel servers and we are good with only 3 Redis servers.
This means that we need to:
Move sentinel out of Redis into a new set of hostsRemove sentinel from all the Redis hostsDrop 2 Redis servers
Here's my revised plan after all the pain we felt recently.
We need to start with 5 small sentinels, independent from the redis nodes. Then we need to split into a number of instances with different characteristics. This is still open for discussion but for now I'd at least start with two: persist and cache.
Persist
This type is used by every key that requires strong persistence. We can use the same settings that we currently got. The only difference will be the amount of keys being stored and the data set size, which will both be significantly smaller, resulting in a much more stable instance.
We could use a master and two slaves for this type. Redis cluster could be implemented but there are limitations that we need to be aware of, especially related to multi-key operations.
Cache
All the volatile data should end up here. There is no persistence configured on this type so if an instance goes down then everything is lost. This tradeoff would allow this type to be very performant.
We can replicate to another instance using diskless replication. My idea is that on these instances the disk is never touched other than for writing logs. Bonus points if we can shard from the application but I realise it's a big ask.
I've already started working on something in https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/23. I'm currently working to understand how to size those nodes. Then we need to adjust the Chef roles and cookbooks to do what we want.
We could implement other Redis "pods", like for Sidekiq, sessions, CI, etc. Let's identify them so I can build them too.
Update 25 May: we've got some 40 GB of cache in Redis. This means we need to rethink the cache pod. We need to shard this data and we could make good use of Redis Cluster for this. I'm going to ponder a bit on this and then update this issue with a plan.