Skip to content

Fair usage of Shared Runners

Kamil Trzcińśki requested to merge fair-usage-of-shared-runners into master

What does this MR do?

Introduces a fair usage scheduler for shared runners.

It tries to assign builds to shared runner from projects that have the lowest number of builds currently running on shared runners.

Example 1:

We have following builds in queue:
build 1 for project 1
build 2 for project 1
build 3 for project 1
build 4 for project 2
build 5 for project 2
build 6 for project 3

With the new algorithm we will assign builds in following order:
- We choose build 1, because project 1 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We choose build 4, because project 2 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We choose build 6, because project 3 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We choose build 2, because project 1 as other it runs 1 build,
- We choose build 5, because project 2 runs 1 build, where project 1 runs 2 builds now,
- We choose build 3, because project 1 and runs 2 builds.

Example 2:

We have following builds in queue:
build 1 for project 1
build 2 for project 1
build 3 for project 1
build 4 for project 2
build 5 for project 2
build 6 for project 3

With the new algorithm we will assign builds in following order:
- We choose build 1, because project 1 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We finish build 1,
- We choose build 2, because project 1 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We choose build 4, because project 2 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We finish build 4,
- We choose build 5, because project 2 doesn't run currently any builds and has the lowest build number from projects that doesn't run builds,
- We choose build 6, because project 3 doesn't run currently any builds,
- We choose build 3, because project 1, 2 and 3 runs exactly one build now,

Why was this MR needed?

Currently, we are scheduling builds using FIFO. This is catastrophic if there are projects that create a 100-300 jobs, this basically eats most of available shared runners.

Performance

All this logic is implemented with the help of SQL queries, because this is the fastest way to process 1k-2k pending builds in queue. It's not the fastest SQL query, because it sorts based on number of running_builds, and this forces to calculate a number of running builds for all dependent projects. However, since we have one/two shared runners that asks every few seconds for builds this should have minimal impact on DB performance.

 explain analyze SELECT "ci_builds".* FROM "ci_builds" JOIN (SELECT "ci_builds"."gl_project_id", count(case when status = 'running' AND runner_id = (SELECT "ci_runners"."id" FROM "ci_runners" WHERE "ci_runners"."is_shared" = 't') then 1 end) as running_builds FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."gl_project_id" AND "projects"."pending_delete" = 'f' WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" IN ('running', 'pending') AND "projects"."builds_enabled" = 't' AND "projects"."shared_runners_enabled" = 't' GROUP BY "ci_builds"."gl_project_id") AS projects ON ci_builds.gl_project_id=projects.gl_project_id WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'pending' AND "ci_builds"."runner_id" IS NULL  ORDER BY projects.running_builds ASC, ci_builds.id ASC;
                                                                                  QUERY PLAN                                                                        
           
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------
 Sort  (cost=64777.28..64777.29 rows=1 width=1010) (actual time=301.794..302.535 rows=1537 loops=1)
   Sort Key: (count(CASE WHEN (((public.ci_builds.status)::text = 'running'::text) AND (public.ci_builds.runner_id = $0)) THEN 1 ELSE NULL::integer END)), public.ci
_builds.id
   Sort Method: quicksort  Memory: 1423kB
   ->  Nested Loop  (cost=63279.78..64777.27 rows=1 width=1010) (actual time=66.384..298.724 rows=1537 loops=1)
         ->  HashAggregate  (cost=63177.15..63177.30 rows=15 width=15) (actual time=65.641..65.851 rows=187 loops=1)
               InitPlan 1 (returns $0)
                 ->  Seq Scan on ci_runners  (cost=0.00..26963.66 rows=1 width=4) (actual time=1.145..34.381 rows=1 loops=1)
                       Filter: is_shared
                       Rows Removed by Filter: 6965
               ->  Nested Loop  (cost=0.00..36186.34 rows=2715 width=15) (actual time=0.065..29.717 rows=1710 loops=1)
                     ->  Index Scan using index_ci_builds_on_status on ci_builds  (cost=0.00..8913.95 rows=3577 width=15) (actual time=0.051..12.012 rows=2583 loops
=1)
                           Index Cond: ((status)::text = ANY ('{running,pending}'::text[]))
                           Filter: ((type)::text = 'Ci::Build'::text)
                           Rows Removed by Filter: 1219
                     ->  Index Scan using projects_pkey on projects  (cost=0.00..7.61 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=2583)
                           Index Cond: (id = public.ci_builds.gl_project_id)
                           Filter: ((NOT pending_delete) AND builds_enabled AND shared_runners_enabled)
                           Rows Removed by Filter: 0
         ->  Bitmap Heap Scan on ci_builds  (cost=102.63..106.64 rows=1 width=1002) (actual time=1.216..1.231 rows=8 loops=187)
               Recheck Cond: ((gl_project_id = public.ci_builds.gl_project_id) AND ((status)::text = 'pending'::text))
               Filter: ((runner_id IS NULL) AND ((type)::text = 'Ci::Build'::text))
               ->  BitmapAnd  (cost=102.63..102.63 rows=1 width=0) (actual time=1.201..1.201 rows=0 loops=187)
                     ->  Bitmap Index Scan on index_ci_builds_on_gl_project_id  (cost=0.00..10.52 rows=241 width=0) (actual time=0.406..0.406 rows=1944 loops=187)
                           Index Cond: (gl_project_id = public.ci_builds.gl_project_id)
                     ->  Bitmap Index Scan on index_ci_builds_on_status  (cost=0.00..91.78 rows=3089 width=0) (actual time=0.652..0.652 rows=3362 loops=187)
                           Index Cond: ((status)::text = 'pending'::text)
 Total runtime: 303.832 ms

Specific runners

It doesn't affect the specific runners which still serve builds FIFO.

@stanhu @markpundsack @yorickpeterse What do you think?

Merge request reports