Skip to content

WIP: Greatly optimise RegisterBuildService by intelligently accessing tags

Kamil Trzcińśki requested to merge optimise-register-build-service into master

What does this MR do?

Currently when a BuildRegisterService is executed we execute N*2 of queries due to tag matching. It's terribly slow as we can see here: https://performance.gitlab.net/dashboard/db/grape-endpoints?var-action=Grape%23POST%20%2Fbuilds%2Fregister&var-database=Production.

What is slow there, is that we are evaluating tags. Every time. This is crazy.

This MR adds an acceleration table: ci_runner_builds is a list of builds that could potentially be picked by runner, where these builds are already filtered and eligible to be processed.

Due that this is DB table amount of changes is minimal, and basically scales very well, when we assume that we have a few runners only, not hundreds.

The runner_builds entries are created asynchronously when a build does transition to pending. They will be removed automatically when it transitions out of pending.

We effectively execute a single query to get data out of the table, that just contains most recent one.

It makes the builds/register.json to be of constant cost, instead of variable cost, as it is now. It puts a little more work on sidekiq, which is now responsible for "filtering" builds to runners.

There's one change in behavior now. If runner tags will get changed, it will not affect currently enqueued builds. The current behavior is that if runner tags are changed, this runner picks builds immediately. (something to reconsider if should be changed).

Thoughts

  1. This uses DB,
  2. This makes for every enqueued build to create N-entries (where N is number of runners that can run this build),
  3. Once build is transitioned we remove these N-entries,
  4. This creates a pressure on the table, and creates a lot of dead tuples, but also keeps the table very small. For 2000 builds in the queue, we would expect to have to have roughly 4000-6000 entries. For 0 builds, 0 entries in the table. This can be considered more like a temporary table.
  5. Using table is due to fact that we have fair scheduling (balancing jobs across runners), it can be quite difficult to implement that with Redis, but possible. Switching to FIFO (in the case of Redis) is no-go now, as in case someone pushes 1000 jobs it would basically block processing. We could probably introduce multiple buckets per-shared-runner (to prevent this from happening),
  6. This brings me to question what would be impact on DB and whether this is correct approach now or not,
  7. It solves problem of slow builds/register, making this request basically a single query on small table, that would be fully in memory. Currently it's N+1 queries. Which is shown on this endpoint page. This approach opens us a way to later look for more efficient approach for this problem: proper queueing system, based on some existing solution. Yet to decide.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Merge request reports

Loading