Currently GitLab Geo uses the Elasticsearch application settings and therefore the same Elasticsearch cluster as the primary. We may need to consider adding the ability to specify Geo Elasticsearch nodes.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?
Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?
This is easy to solve, you can just create separate ES node(Replica Shard) that is close to your geo node and that's it. The replication process will be taken care by ES
@vsizov Is it wise to use ES replication on a WAN? We would have to open some ports between these nodes, and I'm not certain that communication is encrypted. Not the mention I'm not really sure that ES is meant to handle the latency over a WAN for replication purposes.
I wonder if the easiest implementation for Geo would be to allow specifying ES connection details for secondary nodes and let each secondary Geo instance index it's own dataset.
@stanhu ES clusters communicate on a proprietary protocol on port 9300 IIRC. Also, clusters assume they are on the same local network. I also found this post from Elastic talking about the topic: "We are frequently asked whether it is advisable to distribute an Elasticsearch cluster across multiple data centers (DCs). The short answer is "no"". See https://www.elastic.co/blog/clustering_across_multiple_data_centers
What problem we're trying to solve? An additional latency of search requests due to slow connections to remote ES servers?
If yes, then I'm a bit confused because we're talking about one HTTP(s) request (actual search query). So, theoretically, this should not create any problems. Unfortunately, I don't have any practical information...
For the DR case, it is (sort of) important to have an independent copy of the elasticsearch data in the same physical location as each secondary.
I don't think we should rely on the secondaries to build an independent elasticsearch index - replicating the existing ES data to the secondary in some manner seems like the better option. Ideally, ES would have some kind of standby mode similar to postgresql secondaries. The Geo secondaries could perform queries against that with some "override application_settings" configuration - perhaps that would go on the GeoNode?
@dblessing the first barrier is that we currently store index status in the main GitLab database, which is readonly on secondaries. So there's some significant engineering work required upfront.
Then, actually building the index is computationally expensive. Not everyone is GitLab.com scale but we could easily be talking about a week to a month of processing on each secondary - and search would be broken on the secondary until it were complete. I'd much rather have a bandwidth capacity problem than a CPU capacity problem :)
I'd much rather have a bandwidth capacity problem than a CPU capacity problem
I'm not sure everyone would agree. Especially, let's say for a customer on AWS that is replicating to a secondary in another provider (or even region - though I'm not sure about network costs in this case). It could get really expensive.
I understand the issue with the database state. I'm not sure if it's simple enough to say that replicating is the better way to go from a bandwidth vs. CPU standpoint. The 'hit' for building the index would be a one-time on provisioning a secondary. Of course it's also true for the bandwidth hit so it's mostly a matter of your first point plus what this might cost from a network vs. CPU perspective in a cloud. Something to think about...
CPU time on AWS isn't cheap either. We could probably compute relative costs for the GitLab.com case, to see if there's a clear winner either way? My intuition is that bandwidth wins, but I could easily be wrong
Once the backfill is done, there's also a (smaller) ongoing cost as new documents show up, in both scenarios.
I do suggest moving the index status to elasticsearch itself in https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 - this would resolve one technical barrier to pursuing independent rebuilds on each secondary.
The time issue remains worrying - as there's no processing to be done, I think it's safe to claim that elasticsearch-level replication will inevitably be faster than independent rebuilds. https://gitlab.com/gitlab-org/gitlab-ee/issues/3492 could help to make this concern manageable though.