Improve GitLab Geo backfill so that it can be managed properly at scale
A follow-up of gitlab-org/gitlab-ee#1366, but since the methods discussed there are no longer applicable, it was closed.
From that issue https://gitlab.com/gitlab-org/gitlab-ee/issues/1366#note_22651105:
Backfill at scale
Use cases:
- Initial import
- Recovery from backup
- Failed replication (lost jobs) / Redis cleanup
Old way:
Rsync repositories from primary to the secondary node
Goals
- Be able to monitor progress
- Reduce the network load introduced by first iteration
- Run continuously (automatically)
- Be able to enable/disable (pause/unpause)
- Be able to throttle how often we scan and/or how many concurrent updates
This part is still applicable.
Current state
Geo::RepositorySyncWorker
runs every 5min. It scans the Projects table for projects that are not yet present in the ProjectRegistry (tracking database). Those projects get scheduled and added to the ProjectRegistry.
Problems
- In gitlab-org/gitlab-ee#3453 we noticed this backfill is too slow to work at GitLab.com scale.
- To find projects that are not present in the registry yet, a gigantic cross-database ID pluck is needed. Although a possible/partial solution is to use Postgres FDW.
Possible improvements
- Bulk backfill Projects to the ProjectRegistry, so a cross-database pluck is no longer needed (on every run). (see also discussion at https://gitlab.com/gitlab-org/gitlab-ee/issues/3259#note_38920807)
- ???
Related issues/MRs
- gitlab-org/gitlab-ee!2838 is changing the behavior of the Geo Log Cursor, so changes detected by the cursor are handled instantaneously instead of by
Geo::RepositorySyncWorker
- gitlab-org/gitlab-ee#1598 is discussing how at the moment
Geo::RepositorySyncWorker
takes care of syncing projects that are updated by "pull from mirror" (because noGeo::RepositoryUpdatedEvent
is created at the moment)