So these repositories are failing because the storages aren't available on the secondary? Does the secondary have the missing storages in the config, or not?
In addition to fixing this queuing bug, assuming the repository storages aren't in the config, I think we should:
Not even try to sync repositories that are in a missing storage
Display a large warning in the admin area if projects exist that reference a storage not in the config. This can be CE+EE
With https://gitlab.com/gitlab-org/gitlab-ee/issues/3230 we can schedule the repository sync when processing the project create/update events in the Geo Log Cursor, leaving the Geo::RepositorySyncWorker responsible only for the initial backfill making those queries simpler. @nick.thomas Wdyt?
Should think if this task needs to be performed while the secondary node is up or down. If the node needs to be down how we can ensure that we created a Geo::ProjectRegistry entry for projects created on the primary node during this interval?
This should be used by Geo::RepositorySyncWorker, so we can remove the interleaving of Project & Geo::ProjectRegistry.
@dbalexandre I think #full_scan! can be improved a lot by avoiding the create on missing_projects.each, and instead use something like active_record-pg_generate_series, like you are doing in gitlab-org/gitlab-ee!2701 (maybe we need activerecord-import instead?).
That makes sense. But isn't that already taken care of by GitLab::Geo::LogCursor::Daemon#handle_repository_update?
2.Should think if this task needs to be performed while the secondary node is up or down. If the node needs to be down how we can ensure that we created a Geo::ProjectRegistry entry for projects created on the primary node during this interval?
This is only needed to backfill the projects that existed before the secondary started processing Geo::EventLog, so won't the cursor pick up the projects that are created between the #full_scan and the processing of the Geo eventlog?
With #3230 (closed) we can remove the interleaving of Project & Geo::ProjectRegistry but we still need to pluck the project ids.
Do we? I'd like to have a system where the Geo::RepositorySyncWorker can fully rely on Geo::ProjectRegistry.
@dbalexandre I think #full_scan! can be improved a lot by avoiding the create on missing_projects.each, and instead use something like active_record-pg_generate_series, like you are doing in gitlab-org/gitlab-ee!2701 (maybe we need activerecord-import instead?).
@to1ne Yes, we have a lot of room for improvements here.
That makes sense. But isn't that already taken care of by GitLab::Geo::LogCursor::Daemon#handle_repository_update?
GitLab::Geo::LogCursor::Daemon#handle_repository_update handle only repository update events (e.g. push events). We need to create a new event type Geo::RepositoryCreatedEvent and when processing it on Geo::EventLog we can create a new Geo::ProjectRegistry entry and schedule the repository sync.
This is only needed to backfill the projects that existed before the secondary started processing Geo::EventLog, so won't the cursor pick up the projects that are created between the #full_scan and the processing of the Geo eventlog?
If the Geo::RepositorySyncWorker relies only on Geo::ProjectRegistry and we run this task when the node is down no.
Do we? I'd like to have a system where the Geo::RepositorySyncWorker can fully rely on Geo::ProjectRegistry.
I mean that we can remove the interleaving of projects that never have been synced with projects updated recently.
So #find_project_ids_not_synced will only return projects that are not in the registry.
#find_project_ids_updated_recently already returns projects that are in the registry.
These are interleaved, and slowly all projects will get attempted to sync and added to the registry.
Pitfall
Due to the interleaving, every iteration projects that failed in a previous iteration are retried (those returned by #find_project_ids_updated_recently). So this will half the capacity to sync projects that are not attempted already.
Although this might be a quick-fix we can do in a %9.5 patch release? So we can get our .com testbed up and running.