Sequential scans on "routes" table increased from 0 to 1 billion scans per minute

added ~874211 Next Patch Release ~480950 labels

changed milestone to %9.1

The commits that I could find referencing routes in their messages are:

cd4db7b4
7a774d1a
e6cc7a0a
71932711
dd996223 (why is this here twice?)

Hm, the above commits don't seem to contain anything that stands out as a possible cause.

@yorickpeterse I think it's just https://gitlab.com/gitlab-org/gitlab-ce/commit/2989192d1aa8051aa09164cd097418bd3063d4ad (the commit that added this), as that was only deployed this morning.

MR - https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8979

@yorickpeterse @smcgivern we get project/group full name/path from routes table now so we obviously use this table a lot now.

Discussion here https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8845 and merged mr https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8979.

cc @DouweM

@dzaporozhets There's a difference between "we use this a lot" and "we perform sequence scans a lot". Sequence scans mean PostgreSQL will iterate over every single row in the table to find its data, skipping any indexes. This is terrible for performance. We need to make sure that whatever queries were added use the right indexes.

@yorickpeterse ok thanks for explanation. Do you have idea what index is missing or what we do wrong there?

Maybe this scope does not use index? https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8979/diffs#f6e4c93f19717cbef4f7d9401b7bd7bd07dee5da_13_12

mentioned in issue #29534 (closed)

https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/10004 adds this scope to the todos page, which didn't have it before, but it seems like that will still do a sequential scan (it will just do one instead of hundreds), so an index might be needed there.

@dzaporozhets

Do you have idea what index is missing or what we do wrong there?

Unfortunately not. Looking at the routes table the path column is indexed. Perhaps the code somewhere is filtering by name (which is not indexed)?

Ah, I think I found the problem: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8979/diffs#d77bcf2e32d9b5be36cec41f45385e3a96ae2202_232_235

For example:

gitlabhq_production=# explain analyze select * from routes where path like 'gitlab-org/%';
                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Seq Scan on routes  (cost=0.00..68721.08 rows=315 width=73) (actual time=169.016..730.816 rows=81 loops=1)
   Filter: ((path)::text ~~ 'gitlab-org/%'::text)
   Rows Removed by Filter: 3255576
 Planning time: 0.810 ms
 Execution time: 730.890 ms
(5 rows)

For that column to be indexed you need to use trigram indexes, which:

Are expensive to maintain
Are a total pain to maintain code wise because MySQL doesn't support them
Take up a lot of storage space

Alternatively we can use PostgreSQL's text_pattern_ops index operator (https://www.postgresql.org/docs/current/static/indexes-opclass.html) but I don't know if there is a MySQL equivalent.

I'm adding a test index to see how varchar_pattern_ops (since path is a varchar) works.

Well that is disappointing:

gitlabhq_production=# create index concurrently index_routes_on_path_text_pattern_ops ON routes (path varchar_pattern_ops);
CREATE INDEX
gitlabhq_production=# explain analyze select * from routes where path ilike 'gitlab-org/%';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on routes  (cost=0.00..70042.20 rows=326 width=73) (actual time=931.092..3785.028 rows=81 loops=1)
   Filter: ((path)::text ~~* 'gitlab-org/%'::text)
   Rows Removed by Filter: 3255618
 Planning time: 0.398 ms
 Execution time: 3785.085 ms
(5 rows)

gitlabhq_production=# analyze routes;
ANALYZE
gitlabhq_production=# explain analyze select * from routes where path ilike 'gitlab-org/%';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on routes  (cost=0.00..70042.24 rows=326 width=73) (actual time=748.342..3752.162 rows=81 loops=1)
   Filter: ((path)::text ~~* 'gitlab-org/%'::text)
   Rows Removed by Filter: 3255619
 Planning time: 0.376 ms
 Execution time: 3752.233 ms
(5 rows)

Let's try text_pattern_ops instead.

gitlabhq_production=# create index concurrently index_routes_on_path_text_pattern_ops ON routes (path text_pattern_ops);
CREATE INDEX
gitlabhq_production=# explain analyze select * from routes where path ilike 'gitlab-org/%';
analyze routes;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on routes  (cost=0.00..70042.74 rows=326 width=73) (actual time=874.987..3609.586 rows=81 loops=1)
   Filter: ((path)::text ~~* 'gitlab-org/%'::text)
   Rows Removed by Filter: 3255674
 Planning time: 0.246 ms
 Execution time: 3609.628 ms
(5 rows)

gitlabhq_production=# analyze routes;
ANALYZE
gitlabhq_production=# explain analyze select * from routes where path ilike 'gitlab-org/%';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on routes  (cost=0.00..70042.94 rows=326 width=73) (actual time=766.538..3599.232 rows=81 loops=1)
   Filter: ((path)::text ~~* 'gitlab-org/%'::text)
   Rows Removed by Filter: 3255674
 Planning time: 0.384 ms
 Execution time: 3599.324 ms
(5 rows)

That doesn't seem to work either :<

Oh derp, I was using ILIKE. LIKE seems to use an index just fine when using text_pattern_ops. I'll try varchar_pattern_ops again.

gitlabhq_production=# create index concurrently index_routes_on_path_text_pattern_ops ON routes (path varchar_pattern_ops);
CREATE INDEX
gitlabhq_production=# explain analyze select * from routes where path like 'gitlab-org/%';
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_routes_on_path_text_pattern_ops on routes  (cost=0.43..8.45 rows=326 width=73) (actual time=0.018..0.150 rows=81 loops=1)
   Index Cond: (((path)::text ~>=~ 'gitlab-org/'::text) AND ((path)::text ~<~ 'gitlab-org0'::text))
   Filter: ((path)::text ~~ 'gitlab-org/%'::text)
 Planning time: 0.348 ms
 Execution time: 0.193 ms
(5 rows)

That is much better.

This basically leaves us with two options:

We add the index (this has to be added on top of the existing normal path index), and deal with all the pain that is maintaining PostgreSQL specific code (e.g. setting up a new DB from scratch should include this index)
We find a way to modify the code so that it doesn't have to use a LIKE

Impact on sequence scans:

We find a way to modify the code so that it doesn't have to use a LIKE

@yorickpeterse I believe the use of path like 'gitlab-org/%' is the only easy way to get list of descendants for group and avoid recursive lookup (considering nested groups).

mentioned in issue #29578 (closed)

mentioned in issue #29579 (closed)

Anything that is before wildcard in LIKE queries should use the index. Maybe there is another place where we do SS?

Oddly enough I'm not seeing a reduction in response timings with this index in place, and a very minor reduction in SQL timings. I wonder if PostgreSQL hasn't fully optimised queries just yet.

@vsizov

Anything that is before wildcard in LIKE queries should use the index.

Not when using the default index operators in combination with a non C locale (we use UTF8/Unicode). We're probably bitten by this in other places.

@dzaporozhets

Fair enough. In that case we need to:

Add a migration that adds this index concurrently, but only for PostgreSQL. I believe we can set the operator class using the opclasses: option
Ensure the index is created when setting up a new DB. This is done by loading the migration into lib/tasks/migrate/setup_postgresql.rake and manually migrating it

For MySQL we have a hack that ignores the opclasses: option, so everything else should just work

This is where the slow query logs might come in handy. We should pull up the pgbadger output in a graphical format so everyone can see the biggest culprits.

@stanhu Ironically I ran pgbadger earlier today, and it did not spit out any routes queries. Instead most of the queries where queries involving project_authorizations and massively nested sub-queries/unions.

One more problem I see there(probably minor) is that we allow "_" in a path. If you use it in LIKE the index won't be used as it's a special character "any character".

UPDATE: We need to escape it explicitly.

Looking at the 99th percentile of SQL timings there is a drop, but it's too early to tell if this is related or just coincidence; timings over the past 24 hours go up and down in waves.

Could we perhaps enable auto explain on dev or staging environments and see what comes up there?

Leaving this here so I don't forget:

:vertical_traffic_light::vertical_traffic_light::vertical_traffic_light: ~~Once we properly solve and deploy this we need to remove the index index_routes_on_path_text_pattern_ops.~~ The migration will re-use this index name and check for existence, removing the need for this.

Also, I'm moving this to 9.0 since we need this fixed in 9.0 as this is a pretty serious regression.

changed milestone to %9.0

@dzaporozhets Maybe it's not "_" symbol is responsible for performance degradation but at least it causes this https://gitlab.com/gitlab-org/gitlab-ce/issues/29583

assigned to @yorickpeterse

@stanhu Yes, auto explain sounds like something that would be useful.

Regarding the problem in this issue, I'll take care of it since there's no MR for this yet.

mentioned in commit 2d542358

mentioned in merge request !10060 (merged)

mentioned in issue #29170 (closed)

mentioned in commit 69af06dd

closed via commit b3d77dba

closed via commit 69af06dd

closed via merge request !10060 (merged)

mentioned in commit b3d77dba

mentioned in commit dcb55f3e

mentioned in merge request !14785

Sequential scans on "routes" table increased from 0 to 1 billion scans per minute

Designs

Child items ...

Activity

Admin message

Admin message

Sequential scans on "routes" table increased from 0 to 1 billion scans per minute

Activity