Improve GitLab Geo backfill so that it can be managed properly at scale

A follow-up of gitlab-org/gitlab-ee#1366, but since the methods discussed there are no longer applicable, it was closed.

From that issue https://gitlab.com/gitlab-org/gitlab-ee/issues/1366#note_22651105:

Backfill at scale

Use cases:

Initial import

Recovery from backup

Failed replication (lost jobs) / Redis cleanup

Old way:

Rsync repositories from primary to the secondary node

Goals

Be able to monitor progress

Reduce the network load introduced by first iteration

Run continuously (automatically)

Be able to enable/disable (pause/unpause)

Be able to throttle how often we scan and/or how many concurrent updates

This part is still applicable.

Current state

Geo::RepositorySyncWorker runs every 5min. It scans the Projects table for projects that are not yet present in the ProjectRegistry (tracking database). Those projects get scheduled and added to the ProjectRegistry.

Problems

In gitlab-org/gitlab-ee#3453 we noticed this backfill is too slow to work at GitLab.com scale.
To find projects that are not present in the registry yet, a gigantic cross-database ID pluck is needed. Although a possible/partial solution is to use Postgres FDW.

Possible improvements

Bulk backfill Projects to the ProjectRegistry, so a cross-database pluck is no longer needed (on every run). (see also discussion at https://gitlab.com/gitlab-org/gitlab-ee/issues/3259#note_38920807)
???

Related issues/MRs

gitlab-org/gitlab-ee!2838 is changing the behavior of the Geo Log Cursor, so changes detected by the cursor are handled instantaneously instead of by Geo::RepositorySyncWorker
gitlab-org/gitlab-ee#1598 is discussing how at the moment Geo::RepositorySyncWorker takes care of syncing projects that are updated by "pull from mirror" (because no Geo::RepositoryUpdatedEvent is created at the moment)

cc @brodock @nick.thomas @dbalexandre @stanhu

Admin message

Admin message

Improve GitLab Geo backfill so that it can be managed properly at scale

Backfill at scale

Use cases:

Old way:

Goals

Current state

Problems

Possible improvements

Related issues/MRs