Geo: Repository sync appears to be retrying the same projects

So these repositories are failing because the storages aren't available on the secondary? Does the secondary have the missing storages in the config, or not?

In addition to fixing this queuing bug, assuming the repository storages aren't in the config, I think we should:

Not even try to sync repositories that are in a missing storage
Display a large warning in the admin area if projects exist that reference a storage not in the config. This can be CE+EE

If I'm following this right, we ORDER BY projects.last_repository_updated_at DESC in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/workers/geo/repository_sync_worker.rb#L28 . That is interleaved with a query where we Gitlab::Database.nulls_first_order(:last_repository_synced_at, :desc)

Presumably, we want to exclude projects that have had a recent attempt at syncing from the results?

Line 27 terrifies me, but I'll put that in a separate issue: https://gitlab.com/gitlab-org/gitlab-ee/issues/3269

With https://gitlab.com/gitlab-org/gitlab-ee/issues/3230 we can schedule the repository sync when processing the project create/update events in the Geo Log Cursor, leaving the Geo::RepositorySyncWorker responsible only for the initial backfill making those queries simpler. @nick.thomas Wdyt?

assigned to @to1ne

marked this issue as related to #3265 (closed)

@dbalexandre I like the separation, it makes things easier to understand.

Do we have data in the secondary database that we can use to tell us backfill needs to happen?

@nick.thomas We don't have. Today we check for projects that don't have a Geo::ProjectRegistry entry on the secondary database.

I am thinking to make use of GitLab::Geo::LogCursor::Daemon#full_scan! to backfill the Geo::ProjectRegistry entries for all the projects that do not have one yet.

This should be used by Geo::RepositorySyncWorker, so we can remove the interleaving of Project & Geo::ProjectRegistry.

I'll dig into the suggestion made in the code comments to optimize that method.

@nick.thomas @dbalexandre @stanhu WDYT?

@to1ne I think that GitLab::Geo::LogCursor::Daemon#full_scan! can be really slow for large instances. But if we decide to go in this direction we:

Should handle the project creation event in the Geo event log to create a new Geo::ProjectRegistry entry (https://gitlab.com/gitlab-org/gitlab-ee/issues/3230).
Should think if this task needs to be performed while the secondary node is up or down. If the node needs to be down how we can ensure that we created a Geo::ProjectRegistry entry for projects created on the primary node during this interval?

This should be used by Geo::RepositorySyncWorker, so we can remove the interleaving of Project & Geo::ProjectRegistry.

With https://gitlab.com/gitlab-org/gitlab-ee/issues/3230 we can remove the interleaving of Project & Geo::ProjectRegistry but we still need to pluck the project ids.

I think that GitLab::Geo::LogCursor::Daemon#full_scan! can be really slow for large instances

@dbalexandre I think #full_scan! can be improved a lot by avoiding the create on missing_projects.each, and instead use something like active_record-pg_generate_series, like you are doing in gitlab-org/gitlab-ee!2701 (maybe we need activerecord-import instead?).

1.Should handle the project creation event in the Geo event log to create a new Geo::ProjectRegistry entry (https://gitlab.com/gitlab-org/gitlab-ee/issues/3230).

That makes sense. But isn't that already taken care of by GitLab::Geo::LogCursor::Daemon#handle_repository_update?

2.Should think if this task needs to be performed while the secondary node is up or down. If the node needs to be down how we can ensure that we created a Geo::ProjectRegistry entry for projects created on the primary node during this interval?

This is only needed to backfill the projects that existed before the secondary started processing Geo::EventLog, so won't the cursor pick up the projects that are created between the #full_scan and the processing of the Geo eventlog?

With #3230 (closed) we can remove the interleaving of Project & Geo::ProjectRegistry but we still need to pluck the project ids.

Do we? I'd like to have a system where the Geo::RepositorySyncWorker can fully rely on Geo::ProjectRegistry.

@dbalexandre I think #full_scan! can be improved a lot by avoiding the create on missing_projects.each, and instead use something like active_record-pg_generate_series, like you are doing in gitlab-org/gitlab-ee!2701 (maybe we need activerecord-import instead?).

@to1ne Yes, we have a lot of room for improvements here.

That makes sense. But isn't that already taken care of by GitLab::Geo::LogCursor::Daemon#handle_repository_update?

GitLab::Geo::LogCursor::Daemon#handle_repository_update handle only repository update events (e.g. push events). We need to create a new event type Geo::RepositoryCreatedEvent and when processing it on Geo::EventLog we can create a new Geo::ProjectRegistry entry and schedule the repository sync.

This is only needed to backfill the projects that existed before the secondary started processing Geo::EventLog, so won't the cursor pick up the projects that are created between the #full_scan and the processing of the Geo eventlog?

If the Geo::RepositorySyncWorker relies only on Geo::ProjectRegistry and we run this task when the node is down no.

Do we? I'd like to have a system where the Geo::RepositorySyncWorker can fully rely on Geo::ProjectRegistry.

I mean that we can remove the interleaving of projects that never have been synced with projects updated recently.

I'll need to dig a little before I can give an informed opinion on the solution!

@nick.thomas @dbalexandre I've had some trouble to wrap my head around it, so I've been testing and this is my conclusion:

First iteration

#find_project_ids_not_synced returns a set of projects, e.g. A, B, and C.

#find_project_ids_updated_recently returns nothing, because the registry is empty.

So A, B, and C are attempted to sync, but fail, and are added to the registry.

Second and following iterations

#find_project_ids_not_synced returns the same set of projects because they are ordered by last_repository_updated_at, so: A, B, C.

#find_project_ids_updated_recently returns A, B, and C, because those are the only projects present in the registry.

These set are interleaved and #uniqed, resulting in A, B, and C. This happens over and over again.

Possible solution

I came up with a rather simple solution:

diff --git a/app/workers/geo/repository_sync_worker.rb b/app/workers/geo/repository_sync_worker.rb
index 9aa0bec00f..f5a3c591cc 100644
--- a/app/workers/geo/repository_sync_worker.rb
+++ b/app/workers/geo/repository_sync_worker.rb
@@ -24,7 +24,7 @@ module Geo
 
     def find_project_ids_not_synced
       current_node.projects
-                  .where.not(id: Geo::ProjectRegistry.synced.pluck(:project_id))
+                  .where.not(id: Geo::ProjectRegistry.pluck(:project_id))
                   .order(last_repository_updated_at: :desc)
                   .limit(db_retrieve_batch_size)
                   .pluck(:id)

So #find_project_ids_not_synced will only return projects that are not in the registry.

#find_project_ids_updated_recently already returns projects that are in the registry.

These are interleaved, and slowly all projects will get attempted to sync and added to the registry.

Pitfall

Due to the interleaving, every iteration projects that failed in a previous iteration are retried (those returned by #find_project_ids_updated_recently). So this will half the capacity to sync projects that are not attempted already.

Although this might be a quick-fix we can do in a %9.5 patch release? So we can get our .com testbed up and running.

mentioned in merge request !2796 (merged)

I've submitted gitlab-org/gitlab-ee!2796 as PoC.

@to1ne Great investigation. I'd be fine putting this into a patch release. @dbalexandre what do you think about this change?

@stanhu I'm fine too. @to1ne I added my comments on the MR.

changed milestone to %9.5

mentioned in commit 575b677f

closed via merge request !2796 (merged)

mentioned in commit 89ca2d0e

@to1ne Can you check the test instance to see if we're still running into the same issue?

The log is filled with lines like:

{"severity":"INFO","time":"2017-09-07T07:13:34.555Z","class":"Geo::FileDownloadService","object_type":"avatar","object_db_id":674718,"message":"File download","success":false,"bytes_downloaded":-1,"download_time_s":0.173},
{"severity":"INFO","time":"2017-09-07T07:13:34.495Z","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":2559509,"message":"File download","success":false,"bytes_downloaded":-1,"download_time_s":0.14},
{"severity":"INFO","time":"2017-09-07T08:05:21.009Z","class":"Geo::FileDownloadService","object_type":"file","object_db_id":674571,"message":"File download","success":false,"bytes_downloaded":-1,"download_time_s":0.131}

This is due the missing NFS servers (only 1 server is enabled on the testbed).

Regarding repository syncs, I am only seeing:

{"severity":"INFO","time":"2017-09-07T08:20:37.269Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"},
{"severity":"INFO","time":"2017-09-07T08:15:11.483Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"},
```

Normally they should be followed by `"Finished scheduler"`, but I am not seeing those. I am also not seeing any logging by `RepositorySyncService`. I am not sure what is happening. 

@jarv I am using [Kibana](https://log.gitlap.com/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-30m,mode:quick,to:now))&_a=(columns:!(_source),index:%27logstash-*%27,interval:auto,query:(query_string:(analyze_wildcard:!t,query:%27%2Bhostname:%20%22sync%22%20%2Bprogram:%22geo.log%22%27)),sort:!(%27@timestamp%27,desc))) to check the logs. Anything else I should look at?

Interesting, yeah this is all I see in the geo.log on the host:

# tail geo.log
{"severity":"INFO","time":"2017-09-07T08:05:21.060Z","class":"Geo::FileDownloadService","object_type":"lfs","object_db_id":2559366,"message":"File download","success":false,"bytes_downloaded":-1,"download_time_s":0.128}
{"severity":"INFO","time":"2017-09-07T08:05:21.879Z","class":"Geo::FileDownloadDispatchWorker","message":"Finished scheduler"}
{"severity":"INFO","time":"2017-09-07T08:10:35.112Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:15:11.483Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:20:37.269Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:25:08.772Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:30:14.415Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:35:02.769Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:40:27.405Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}
{"severity":"INFO","time":"2017-09-07T08:45:22.782Z","class":"Geo::RepositorySyncWorker","message":"Started scheduler"}

reopened

changed milestone to %10.0

Sounds like the scheduler got stuck somehow. Perhaps issuing a TTIN on the Sidekiq worker might yield some information? https://docs.gitlab.com/ee/administration/troubleshooting/sidekiq.html

changed milestone to %10.1

@to1ne Actually, it sounds like the problem is https://gitlab.com/gitlab-org/gitlab-ee/issues/3241.

Continuing the discussion in #3269 (closed), because that seems to be the blocker at the moment.

This issue itself is already resolved, closing...

closed

mentioned in issue #3487

changed the description

made the issue visible to everyone

Geo: Repository sync appears to be retrying the same projects

Designs

Child items ...

Activity

First iteration

Second and following iterations

Possible solution

Pitfall

Admin message

Admin message

Geo: Repository sync appears to be retrying the same projects

Relates to

Activity

First iteration

Second and following iterations

Possible solution

Pitfall