Better Elasticsearch support for GitLab Geo

added elasticsearch label

Hmm, this is especially aggravated by https://gitlab.com/gitlab-org/gitlab-ee/issues/1373 - pushing potentially-confidential search results across the public Internet over http is bad.

Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?

/cc @vsizov

Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?

This is easy to solve, you can just create separate ES node(Replica Shard) that is close to your geo node and that's it. The replication process will be taken care by ES

mentioned in issue #1747 (closed)

mentioned in issue #1850 (closed)

mentioned in issue #846

@vsizov Is it wise to use ES replication on a WAN? We would have to open some ports between these nodes, and I'm not certain that communication is encrypted. Not the mention I'm not really sure that ES is meant to handle the latency over a WAN for replication purposes.

I wonder if the easiest implementation for Geo would be to allow specifying ES connection details for secondary nodes and let each secondary Geo instance index it's own dataset.

I think the ES protocol is over HTTP(s), so it's up to the people setting up it to make the certificates work.

@stanhu ES clusters communicate on a proprietary protocol on port 9300 IIRC. Also, clusters assume they are on the same local network. I also found this post from Elastic talking about the topic: "We are frequently asked whether it is advisable to distribute an Elasticsearch cluster across multiple data centers (DCs). The short answer is "no"". See https://www.elastic.co/blog/clustering_across_multiple_data_centers

@dblessing Ah, right. Thanks.

What problem we're trying to solve? An additional latency of search requests due to slow connections to remote ES servers?

If yes, then I'm a bit confused because we're talking about one HTTP(s) request (actual search query). So, theoretically, this should not create any problems. Unfortunately, I don't have any practical information...

added feature proposal label

For the DR case, it is (sort of) important to have an independent copy of the elasticsearch data in the same physical location as each secondary.

I don't think we should rely on the secondaries to build an independent elasticsearch index - replicating the existing ES data to the secondary in some manner seems like the better option. Ideally, ES would have some kind of standby mode similar to postgresql secondaries. The Geo secondaries could perform queries against that with some "override application_settings" configuration - perhaps that would go on the GeoNode?

A plugin for ES streaming replication, although I've no idea if it's any good: https://github.com/cdgz/es-replication

@nick.thomas

I don't think we should rely on the secondaries to build an independent elasticsearch index

Why do you feel this is a bad idea?

@dblessing the first barrier is that we currently store index status in the main GitLab database, which is readonly on secondaries. So there's some significant engineering work required upfront.

Then, actually building the index is computationally expensive. Not everyone is GitLab.com scale but we could easily be talking about a week to a month of processing on each secondary - and search would be broken on the secondary until it were complete. I'd much rather have a bandwidth capacity problem than a CPU capacity problem :)

I'd much rather have a bandwidth capacity problem than a CPU capacity problem

I'm not sure everyone would agree. Especially, let's say for a customer on AWS that is replicating to a secondary in another provider (or even region - though I'm not sure about network costs in this case). It could get really expensive.

I understand the issue with the database state. I'm not sure if it's simple enough to say that replicating is the better way to go from a bandwidth vs. CPU standpoint. The 'hit' for building the index would be a one-time on provisioning a secondary. Of course it's also true for the bandwidth hit so it's mostly a matter of your first point plus what this might cost from a network vs. CPU perspective in a cloud. Something to think about...

CPU time on AWS isn't cheap either. We could probably compute relative costs for the GitLab.com case, to see if there's a clear winner either way? My intuition is that bandwidth wins, but I could easily be wrong

Once the backfill is done, there's also a (smaller) ongoing cost as new documents show up, in both scenarios.

I do suggest moving the index status to elasticsearch itself in https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 - this would resolve one technical barrier to pursuing independent rebuilds on each secondary.

The time issue remains worrying - as there's no processing to be done, I think it's safe to claim that elasticsearch-level replication will inevitably be faster than independent rebuilds. https://gitlab.com/gitlab-org/gitlab-ee/issues/3492 could help to make this concern manageable though.

Better Elasticsearch support for GitLab Geo

Designs

Child items ...

Activity

Admin message

Admin message

Better Elasticsearch support for GitLab Geo

Activity