DB load balancers and effect on CI processing

In the short-term, we are applying a patch to the load balancer code to force all runner API calls (e.g. api/v4/builds/register.json) to talk to the primary database. This will eliminate problems with inconsistent state due to replication lag.

We should also consider adding a feature flag in Workhorse to disable it from using the Redis LAST_UPDATE value.

Two fixes:

StaleObjectError: lock only on status change: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/10152,
Make CI API to always use primary: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1493

@ayufan Having given this a thought, I see two potential solutions:

Synchronous commits using application logic (giving us more control)
Changing the key checking logic

Synchronous Commit

This would basically change the procedure from a regular write to a write followed by a loop. This loop will break once all secondaries are in sync with the master. This gives you the following pseudo code:

write
position = current_wal_pointer()

until secondaries.all? { |sec| sec.in_sync?(position) }
  sleep(0.005)
end

The benefit here is that after a write the data is in sync, and everything else can continue as expected. The (big) downsides are that slow replication can stall requests (until they time out), it requires EE specific code mixed with CE code (causing conflicts), and can still lead to inconsistencies if a replica dies and comes back later without having the data yet.

In short, I'd rather not do this.

Changing Key Logic

Here the write procedure is the same. However, when we check the data in Redis we take the following steps:

if redis-key-tells-us-to-do-work
  if there-are-any-builds-with-the-status-we-want
    do-the-actual-work
  end
end

Here the builds check will hit the database, slightly increasing the number of queries. However, in this setup we don't need EE specific code and we don't need to rely on synchronous commits.

Is this something that would work in the current setup?

Also worth mentioning: we can't really make an exception for CI and have it always use the primary. Since CI takes up a huge chunk of our traffic it would be a near complete waste of having DB load balancing in the first place. It also makes our infrastructure less robust as the primary now becomes a single point of failures for both reads and writes.

Long story short, we need to stop assuming all writes are synchronous and data is immediately available in the case of CI. This wouldn't be a problem if we could somehow figure out how to stick CI requests to the primary momentarily (as we do with regular users/API calls).

Changing Key Logic: this is how it works now.

Long story short, we need to stop assuming all writes are synchronous and data is immediately available in the case of CI

This is interesting, but we also need to make sure that API is deterministic, another problem to figure out:

We request build somehow, we receive that, as we talk to primary:
We execute another API call to check status, update trace and it fails with 404 as replication did not synchronize the data.

This is something that is present today.

Synchronous commit may not also work because it is custom-made logic not on database level. And it is possible that there will be an API call that will access the data.

This wouldn't be a problem if we could somehow figure out how to stick CI requests to the primary momentarily (as we do with regular users/API calls).

Can you point to that? Maybe there's some way for us to do this thing somehow. Most times we would start with replica, if this would fail for some reasons, we could fallback to primary as last resort.

@ayufan As far as I understand DB load ballancing only reads are ballanced and writes are done on the primary. How about storing current state of job in something that is replicated faster, e.g. redis? Then API requests from Runner could get actual data from cache and each save would write current state to DB and update redis.

@tmaczukin Can you describe how it would work?

@yorickpeterse I see how it works now. Does it create a window of 30s where we force to use primary, right? And we kind of reduce a chance of things being broken. I think that we need something like that for runner.

@ayufan Like a standard cache mechanism.

Writing data to DB - so while creating, updating, finishing, droping... the job - we would additionally write the updated state of object to redis. On read we would read the object from redis instead of database. This would give us a reliable solution for storage (the current state of job is always stored in DB), and for API calls a central and always updated version of the same object.

Also. We have seen recently about git clones for CI failing too: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/2262. Since the clone token is stored in DB and we fail to update all replicas, because of lag this is also the reason of that.

@tmaczukin It seems like we would have to replicate DB in Redis. Not sure if this is right way to do it.

@yorickpeterse I think stick to primary will not work 100% for ‘builds/register.json’ if you’ll have any timeout on primary stickiness then there will always be some replica that fell behind that will create inconsistent view for some workers.

Better solution to distribute the load and maintain the consistency would be to just shard whatever data builds/register.json uses.

If 100% consistency is not required however then I think the same "shard key" could be used in our current setup to route all queries for specific shard key to primary in the 30s window or until all replicas catch up.

By shard key I mean things like project or group or anything by which we could separate workers to logical groups that do not need to know about each other.

@tmaczukin Redis also has the replication lag so if you move all queries to it you'll create the same bottleneck and eventually replication induced problems but in Redis instead of DB.

The biggest problem is how to detect this lag, as right now this is something new and we need to know what are the implications of that.

@ayufan for distrbuted systems point of view I'd suggest just assuming lag is always after a write. and use something else that is "lagless" to coordinate 'readers' to access either primary or replica - probably using the same mechanism to check for lag for DB LB. I guess just waiting 30s or longer will help for 99% of cases.

https://docs.gitlab.com/ee/administration/database_load_balancing.html#primary-sticking

After a write has been performed GitLab will stick to using the primary for a certain period of time, scoped to the user that performed the write. GitLab will revert back to using secondaries when they have either caught up, or after 30 seconds.

However in CI case you can't scope by User. Maybe project or group could work for workers?

@pchojnacki This is basically how currently load balancing for DB seems to work, it makes it possible that you will have an inconsistent view of different API calls. Everything depends on the replication lag. Since most of the data is slow to change it is mostly OK, but CI is changing very often and this leads to potential problems, because of that.

Interesting to see is how to actually make it more robust.

In CI case you can scope based on runners.

So basically doing primary sticking for CI also makes sense, but we have to do that for all CI API requests and base that on CI runner token and build token.

Like that:

When runner gets changed (build enqueued) we stick to primary for 30s,
When build become running we stick to primary for 30s,
Authorization calls for git clone, docker auth is based on build token: we stick to primary for 30s as in 2.

@yorickpeterse This seems to be our solution.

I see how it works now. Does it create a window of 30s where we force to use primary, right? And we kind of reduce a chance of things being broken. I think that we need something like that for runner.

Sticking basically works this way:

You perform a write
The load balancer marks the current session so all following queries go to the primary
At the end of the request we note down the current WAL pointer from the primary, and store this along with the user's ID (if any) in Redis (the ID is in the key). This key expires automatically after 30 seconds.
The next request we check this value, and if present compare it with the secondaries. If all secondaries are in sync (= they have passed the WAL pointer) we remove the Redis key and use a secondary again, otherwise we will keep using the primary.

If we can somehow use an identifier (similar to the user ID) for multiple runners (e.g. a build batch ID or something like that) we might be able to stick queries for builds/register to the primary. We can (and probably should) also add sticking based on the runner ID, so that the same runner executing multiple API calls can stick properly.

@pchojnacki I'm aware of the redis replication lag, but it's still faster than DB replication. Also caching seems to be one of the most popular use cases of redis. I was using redis/memcache as DB cache for clustered applications with success, thats why I'm writing about such option :)

@yorickpeterse

If we can somehow use an identifier (similar to the user ID) for multiple runners (e.g. a build batch ID or something like that)

Each job has it's own ID and own token which we know at Runner's side. Token is also send as one of parameters (or header) while updating job, uploading artifacts (or patching trace). Could this be an identifier that you're looking for?

Sticking a runner to the primary based on a runner ID is something that can be implemented relatively easily, given the Grape API endpoint exposes a method such as current_runner (similar to current_user). This will solve the case of a runner doing a write in API call 1, then doing a read in API call 2. This however won't solve the builds/register problem since IIRC multiple runners can try to request the same data.

@yorickpeterse It will solve, as builds/register is also with runner token, and we already make builds/register to be more and more runner-aware endpoint due to caching. So applying the same mechanics would work.

@ayufan But is it not possible for multiple runners (with different tokens) to try and obtain the same data, potentially missing it when they read from the secondary? Or is that not a problem?

@yorickpeterse They will fail on optimistic locking. And will retry later.

If there's potential build to take by this runner we will make him to use primary for limited time.

@ayufan So to confirm, sticking a runner to a primary based on the runner ID (if present) should solve the problem of builds not being picked up?

@yorickpeterse Yes.

You see it here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/app/services/ci/update_build_queue_service.rb#L6.

If we update tick_runner_queue to store in Redis information to stick to primary with information about WAL pointer we will use primary for the time to catch up. Which is what we want to achieve.

Second place is when we transition from pending to running, as we need to make Ci::Build access from API to stick to primary too. If we store the WAL pointer it will also be valid approach and we should not hit the API problems anymore.

Note: We need to figure an SQL query to track number of this kind of pipelines: https://gitlab.com/gitlab-com/www-gitlab-com/pipelines/7178220

I did run this query to detect pipelines that do not have running, but have success and created, that they are stuck:

count_builds = Ci::Build.where("commit_id=ci_pipelines.id").select('count(*)')

puts Ci::Pipeline.where(status: [:pending, :running]).
	where("(#{count_builds.finished.to_sql})>0").
	where("(#{count_builds.created.to_sql})>0").
	where("(#{count_builds.running_or_pending.to_sql})=0").
	order(:updated_at).
	pluck(:id, :project_id, :updated_at).
	map(&:to_s)

The query alone for the count:

puts Ci::Pipeline.where(status: [:pending, :running]).
	where("(#{count_builds.finished.to_sql})>0").
	where("(#{count_builds.created.to_sql})>0").
	where("(#{count_builds.running_or_pending.to_sql})=0").
        pluck('count(*)')

Which translates to:

SELECT count(*) FROM "ci_pipelines" WHERE "ci_pipelines"."status" IN ('pending', 'running') AND ((SELECT count(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND (commit_id=ci_pipelines.id) AND "ci_builds"."status" IN ('success', 'failed', 'canceled'))>0) AND ((SELECT count(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND (commit_id=ci_pipelines.id) AND "ci_builds"."status" = 'created')>0) AND ((SELECT count(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND (commit_id=ci_pipelines.id) AND "ci_builds"."status" IN ('running', 'pending'))=0)

Which we see that is 183, but in last week is only 5.

mentioned in issue support-forum#1627

Related? https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1502

@ernstvn Yes

Is there any further action here?

If so, can it be extracted to a cleaner issue that is not labeled as outage?

@ayufan @yorickpeterse re-ping per Pablo's ping a week ago... feels like this one has been addressed?

I will remove the critical label at this stage.

removed critical label

Yes. It is addressed and we don't see anything else here.

closed

DB load balancers and effect on CI processing

Designs

Child items ...

Activity

Synchronous Commit

Changing Key Logic

Admin message

Admin message

DB load balancers and effect on CI processing

Activity

Synchronous Commit

Changing Key Logic