Geo: huge cross-database pluck and `NOT IN (...)` query for LFS objects and attachments
Summary
Sentry event: https://sentry.gitlap.com/gitlab/geo-testbed/issues/70345/
Similarly to https://gitlab.com/gitlab-org/gitlab-ee/issues/3269 , it looks like we're having problems backfilling many LFS attachments in Geo because we pluck a huge list of IDs from the tracking database then run a NOT IN (...)
query on the read-only slave. This takes forever to execute.
Steps to reproduce
Add a fresh Geo secondary to a server with many LFS objects
Example Project
(If possible, please create an example project here on GitLab.com that exhibits the problematic behaviour, and link to it here in the bug report)
(If you are using an older version of GitLab, this will also determine whether the bug has been fixed in a more recent version)
What is the current bug behavior?
PG::TRSerializationFailure: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
from peek/views/pg.rb:17:in `async_exec'
from peek/views/pg.rb:17:in `async_exec'
from active_record/connection_adapters/postgresql_adapter.rb:592:in `block in exec_no_cache'
from active_record/connection_adapters/abstract_adapter.rb:484:in `block in log'
from active_support/notifications/instrumenter.rb:20:in `instrument'
from active_record/connection_adapters/abstract_adapter.rb:478:in `log'
from active_record/connection_adapters/postgresql_adapter.rb:592:in `exec_no_cache'
from active_record/connection_adapters/postgresql_adapter.rb:584:in `execute_and_clear'
from active_record/connection_adapters/postgresql/database_statements.rb:160:in `exec_query'
from active_record/connection_adapters/abstract/database_statements.rb:356:in `select'
from active_record/connection_adapters/abstract/database_statements.rb:32:in `select_all'
from active_record/connection_adapters/abstract/query_cache.rb:70:in `select_all'
from active_record/relation/calculations.rb:186:in `pluck'
from app/workers/geo/file_download_dispatch_worker.rb:36:in `find_lfs_object_ids'
from app/workers/geo/file_download_dispatch_worker.rb:12:in `load_pending_resources'
from app/workers/geo/base_scheduler_worker.rb:122:in `update_pending_resources'
from app/workers/geo/base_scheduler_worker.rb:43:in `block (2 levels) in perform'
from app/workers/geo/base_scheduler_worker.rb:39:in `loop'
from app/workers/geo/base_scheduler_worker.rb:39:in `block in perform'
from app/workers/geo/base_scheduler_worker.rb:152:in `try_obtain_lease'
from app/workers/geo/base_scheduler_worker.rb:34:in `perform'
from sidekiq/processor.rb:199:in `execute_job'
.where.not(id: downloaded_ids)
.order(created_at: :desc)
.limit(db_retrieve_batch_size)
.pluck(:id)
.map { |id| [id, :lfs] }
end
What is the expected correct behavior?
We should use postgres_fdw
to perform these queries. Until that's ready, we need to implement the approach we used in #3269 (closed) to avoid the huge NOT IN
fragment.