current_node is a GeoNode, which maps to a table in the main database, as does the projects relation, which boils down to Project.all.
Geo::ProjectRegistry is in the separate, read-write, per-geo-secondary database. So we have to pluck the IDs here. But what happens when a couple of million projects have been synced? What sort of query do we generate?
Steps to reproduce
Sync a few million projects, try to sync a few more.
What is the current bug behavior?
Postgres will probably fail with a "query too large" error of some sort
What is the expected correct behavior?
We need to be able to scale to several million rows in the project_registry table. I'm pretty sure this doesn't.
Possible fixes
Could we use where(last_repository_updated_at: nil) instead?
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Postgres can handle very big queries, although I agree this might not always be desirable. For instance, I did this:
$ irbirb(main):001:0> open('query', 'w').puts("SELECT COUNT(*) FROM projects WHERE id IN (#{1.upto(5_000_000).to_a.join(',')});")=> nilirb(main):002:0> exit$ du -h query 37M query$ gdk psql -d gitlabhq_development(in /Users/seanmcgivern/Code/gitlab-development-kit)psql (9.6.3)Type "help" for help.gitlabhq_development=# \i gitlab/query count------- 20(1 row)Time: 7408.511 ms
I think what we're seeing here is the cost of the IN() clause being multiplied per row in the projects database.
Where we get that huge list of values from (pluck vs. fdw) is less of an issue than the cost of checking array membership for every row in the projects table.
PG can optimise this from a linear array lookup to a lookup in a set - EXPLAIN ANALYZE would tell us if this is already happening or not. Certainly on my local DB, it's already happening.
Looks good, but how feasible would it be to get it into %10.0 ?
I'm wondering if we can do something a bit like this as a stopgap:
-- on RO replicaSELECTidFROMprojectsORDERBYlast_repository_updated_at-- on node DBSELECTexternal_project_idFROM(VALUES(1),(2),...)ASt(external_project_id)LEFTOUTERJOINproject_registryONt.external_project_id=project_registry.project_idWHEREproject_registry.idISNULLLIMIT<batch-size>
It's still awful, but hopefully it wouldn't suffer from statement timeouts. WDYT?
I can run the above query locally (only a few columns in project_registry though) with 5 million VALUES:
values=(1..5_000_000).map{|id|"(#{id})"}.join(",");nilsql="SELECT external_project_id FROM (VALUES #{values}) AS t(external_project_id) LEFT OUTER JOIN project_registry ON t.external_project_id = project_registry.project_id WHERE project_registry.id IS NULL";nilt1=Time.nowputsGeo::ProjectRegistry.connection.execute(sql).to_a.size# 4999972 recordsputs(Time.now-t1).to_i# 23 seconds
In no sense do I like this. Do we have other options for %10.0? What's the query timeout we need to work to? 5 seconds?
What's the query timeout we need to work to? 5 seconds?
Currently it's 60 seconds but there are plans to reduce this to 15 seconds. However, we try to limit the amount of SQL time per page to 200 ms.
Regarding the query, why are we not storing Geo tables in the same database as GitLab? We seem to rely on data from GitLab tables so doing so would make it trivial to get the data. FDWs are also an option, but I have no idea if/how such an approach would scale.
@yorickpeterse the Geo secondary needs to be able to keep track of which projects it has synced, and which it has not. The database it has access to is read-only, so it has a separate RW database into which the project_registry table is placed.
Since we introduced RepositoryCreatedEvent, I think this query is only used to pick up projects that existed before the secondary was added (i.e., backfill), and those where the RepositoryCreatedEvent fails to be handled for some reason. So perhaps we can consider alternatives that would allow us to remove this query entirely?
So perhaps we can consider alternatives that would allow us to remove this query entirely?
The less code/queries/etc the better, but I can see us still having to do something like this query. FDWs are an option we can definitely investigate for that.
Thinking on it some more, the VALUES approach could be used in the existing direction (rather than plucking all the project IDs into the node DB, continue to pluck the project registry project IDs into the secondary). That's a smaller change that should perform at least as "well" as the query above.