I have a very valid explanation for at least the first one, but looking at system behavior at least 3. is also being as described:
What happens, we have DB load balancing, where one is read-write, the second one is read-only: basically a master and slave.
When runner asks for new builds it asks specific endpoint that uses either of those and this is where the magic happens:
We push a new build to read-write,
We change key in Redis and notify all runners to retry picking builds,
There's replication delay,
Runner connects to build/register and checks a list of builds,
No build is found, we return a changed key,
Endpoint checks read-only DB sees that are no new build, we return 204 and changed notification key,
Runner stores notification key, asks again, hits happy path due to notification to be the same,
Replication catches up, build is pending, but runner is in sync with notification that everything is up to date,
It fails to pick a new build.
What we need:
After enqueuing build (changing status from pending to running) there should be barrier before creating the notification or enqueueing sidekiq jobs to ensure that all slaves are up-to-date.
Otherwise, it creates a split brain problem: we expect that information that is stored in Redis is an indication of something that is committed to DB. With current configuration, it is not true, as commit on replicas may be done after Redis value being saved.
In the short-term, we are applying a patch to the load balancer code to force all runner API calls (e.g. api/v4/builds/register.json) to talk to the primary database. This will eliminate problems with inconsistent state due to replication lag.
We should also consider adding a feature flag in Workhorse to disable it from using the Redis LAST_UPDATE value.
@ayufan Having given this a thought, I see two potential solutions:
Synchronous commits using application logic (giving us more control)
Changing the key checking logic
Synchronous Commit
This would basically change the procedure from a regular write to a write followed by a loop. This loop will break once all secondaries are in sync with the master. This gives you the following pseudo code:
The benefit here is that after a write the data is in sync, and everything else can continue as expected. The (big) downsides are that slow replication can stall requests (until they time out), it requires EE specific code mixed with CE code (causing conflicts), and can still lead to inconsistencies if a replica dies and comes back later without having the data yet.
In short, I'd rather not do this.
Changing Key Logic
Here the write procedure is the same. However, when we check the data in Redis we take the following steps:
if redis-key-tells-us-to-do-work if there-are-any-builds-with-the-status-we-want do-the-actual-work endend
Here the builds check will hit the database, slightly increasing the number of queries. However, in this setup we don't need EE specific code and we don't need to rely on synchronous commits.
Is this something that would work in the current setup?
Also worth mentioning: we can't really make an exception for CI and have it always use the primary. Since CI takes up a huge chunk of our traffic it would be a near complete waste of having DB load balancing in the first place. It also makes our infrastructure less robust as the primary now becomes a single point of failures for both reads and writes.
Long story short, we need to stop assuming all writes are synchronous and data is immediately available in the case of CI. This wouldn't be a problem if we could somehow figure out how to stick CI requests to the primary momentarily (as we do with regular users/API calls).
Synchronous commit may not also work because it is custom-made logic not on database level. And it is possible that there will be an API call that will access the data.
This wouldn't be a problem if we could somehow figure out how to stick CI requests to the primary momentarily (as we do with regular users/API calls).
Can you point to that? Maybe there's some way for us to do this thing somehow. Most times we would start with replica, if this would fail for some reasons, we could fallback to primary as last resort.
@ayufan As far as I understand DB load ballancing only reads are ballanced and writes are done on the primary. How about storing current state of job in something that is replicated faster, e.g. redis? Then API requests from Runner could get actual data from cache and each save would write current state to DB and update redis.
@yorickpeterse I see how it works now. Does it create a window of 30s where we force to use primary, right? And we kind of reduce a chance of things being broken. I think that we need something like that for runner.
Writing data to DB - so while creating, updating, finishing, droping... the job - we would additionally write the updated state of object to redis. On read we would read the object from redis instead of database. This would give us a reliable solution for storage (the current state of job is always stored in DB), and for API calls a central and always updated version of the same object.
@yorickpeterse I think stick to primary will not work 100% for ‘builds/register.json’ if you’ll have any timeout on primary stickiness then there will always be some replica that fell behind that will create inconsistent view for some workers.
Better solution to distribute the load and maintain the consistency would be to just shard whatever data builds/register.json uses.
If 100% consistency is not required however then I think the same "shard key" could be used in our current setup to route all queries for specific shard key to primary in the 30s window or until all replicas catch up.
By shard key I mean things like project or group or anything by which we could separate workers to logical groups that do not need to know about each other.
@tmaczukin Redis also has the replication lag so if you move all queries to it you'll create the same bottleneck and eventually replication induced problems but in Redis instead of DB.
@ayufan for distrbuted systems point of view I'd suggest just assuming lag is always after a write. and use something else that is "lagless" to coordinate 'readers' to access either primary or replica - probably using the same mechanism to check for lag for DB LB. I guess just waiting 30s or longer will help for 99% of cases.
After a write has been performed GitLab will stick to using the primary for a certain period of time, scoped to the user that performed the write. GitLab will revert back to using secondaries when they have either caught up, or after 30 seconds.
However in CI case you can't scope by User. Maybe project or group could work for workers?
@pchojnacki This is basically how currently load balancing for DB seems to work, it makes it possible that you will have an inconsistent view of different API calls. Everything depends on the replication lag. Since most of the data is slow to change it is mostly OK, but CI is changing very often and this leads to potential problems, because of that.
Interesting to see is how to actually make it more robust.
So basically doing primary sticking for CI also makes sense, but we have to do that for all CI API requests and base that on CI runner token and build token.
Like that:
When runner gets changed (build enqueued) we stick to primary for 30s,
When build become running we stick to primary for 30s,
Authorization calls for git clone, docker auth is based on build token: we stick to primary for 30s as in 2.
I see how it works now. Does it create a window of 30s where we force to use primary, right? And we kind of reduce a chance of things being broken. I think that we need something like that for runner.
Sticking basically works this way:
You perform a write
The load balancer marks the current session so all following queries go to the primary
At the end of the request we note down the current WAL pointer from the primary, and store this along with the user's ID (if any) in Redis (the ID is in the key). This key expires automatically after 30 seconds.
The next request we check this value, and if present compare it with the secondaries. If all secondaries are in sync (= they have passed the WAL pointer) we remove the Redis key and use a secondary again, otherwise we will keep using the primary.
If we can somehow use an identifier (similar to the user ID) for multiple runners (e.g. a build batch ID or something like that) we might be able to stick queries for builds/register to the primary. We can (and probably should) also add sticking based on the runner ID, so that the same runner executing multiple API calls can stick properly.
@pchojnacki I'm aware of the redis replication lag, but it's still faster than DB replication. Also caching seems to be one of the most popular use cases of redis. I was using redis/memcache as DB cache for clustered applications with success, thats why I'm writing about such option :)
If we can somehow use an identifier (similar to the user ID) for multiple runners (e.g. a build batch ID or something like that)
Each job has it's own ID and own token which we know at Runner's side. Token is also send as one of parameters (or header) while updating job, uploading artifacts (or patching trace). Could this be an identifier that you're looking for?
Sticking a runner to the primary based on a runner ID is something that can be implemented relatively easily, given the Grape API endpoint exposes a method such as current_runner (similar to current_user). This will solve the case of a runner doing a write in API call 1, then doing a read in API call 2. This however won't solve the builds/register problem since IIRC multiple runners can try to request the same data.
@yorickpeterse It will solve, as builds/register is also with runner token, and we already make builds/register to be more and more runner-aware endpoint due to caching. So applying the same mechanics would work.
@ayufan But is it not possible for multiple runners (with different tokens) to try and obtain the same data, potentially missing it when they read from the secondary? Or is that not a problem?
If we update tick_runner_queue to store in Redis information to stick to primary with information about WAL pointer we will use primary for the time to catch up. Which is what we want to achieve.
Second place is when we transition from pending to running, as we need to make Ci::Build access from API to stick to primary too. If we store the WAL pointer it will also be valid approach and we should not hit the API problems anymore.
SELECT count(*) FROM "ci_pipelines" WHERE "ci_pipelines"."status" IN ('pending', 'running') AND ((SELECT count(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND (commit_id=ci_pipelines.id) AND "ci_builds"."status" IN ('success', 'failed', 'canceled'))>0) AND ((SELECT count(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND (commit_id=ci_pipelines.id) AND "ci_builds"."status" = 'created')>0) AND ((SELECT count(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND (commit_id=ci_pipelines.id) AND "ci_builds"."status" IN ('running', 'pending'))=0)