I did the research and I could not find a silver bullet for our case. Elasticsearch heavily integrates with system where data has many relations like in our case. Now we pass all project ids to Elasticsearch to cut off all items user has no access to. The only optimization I see is to not pass at least ids of public projects. In this case all searchable entities like issues, merge requests, milestones have to have redundant public flag which is set according to visibility of project. But in this case we have another problem - what if project visibility is changed? In this case we would have to update all those records and the amount of them could be insane, for example GitLab CE has more then 10.200 issues and almost 6.000 MRs. Elastic can't update records "in place", in fact it has to retrieve and push object back each time you update it so it's insane :) Anyway I think it's way better then collecting near 200.000 of public project ids (for GitLab.com) and push them to ES. More over I'm sure that huge projects will not change visibility too often :) and even if they do - it will happen in the background which is way better. So @stanhu what do you think?
@vsizov What is the downside of removing the visibility settings from the Elasticsearch index and relying on the SQL DB as the single source of truth? Searches would require getting a list of accessible projects from the SQL DB first.
@stanhu This is how it works now, we rely on database only. But in this case we pass projects ids to ES. And public project ids is on the list!!! so now the process that serves web request has to build this list and pass it to ES. Maybe I didn't understand your question though.
My idea that I described above is that we can pass only private project ids that user has access to and put project visibility level to issues, MRs and milestone. In this case we can use condition like public OR (private AND project.id IN ({passed ids})) inside ES, so the list of passed ids will not be so huge.
@vsizov As you pointed out, the public state can change very easily with a GitLab setting change. Since we can't update ES very easily, I don't think we should put anything in ES that could get out of date quickly. A user could easily create a public project and change his/her mind later, which then should restrict search as well. This opens up a number of security issues. We're better off being conservative here and explicitly requiring an allowed projects list. Even with PostgreSQL search, we'll likely need to limit the number of projects that can be searched at once.
@vsizov Let's imagine a scenario where a project has a significant number of issues, merge requests, and milestones (e.g. GitLab CE):
User changes the visibility setting from public to private.
Elasticsearch results are still available
We then issue a task to delete all data and reimport data
There is some lag before search results are available.
The security concern I have is at step 2. It may take a long time to delete all data to change the visibility field.
A user could then change the project from private to public again. This would invalidate step 3, but we either have to have a way of cancelling that job or queuing another job after the first one.
It seems like a tricky situation, and I'd imagine many people have more private projects than public ones.
I see what you mean BUT it depend on how to implement the worker which is updating the documents. Removing data and indexing again will introduce the security issue for sure, but we don't need to do it. Instead we need just update existing docs with particular value (internally it will remove and create new docs, but we don't care, it just explains why it's expensive). So every time you change project visibility level we create new job which sets certain visibility level for every doc. But even in this case we probably need to cancel all existing jobs (that changes visibility level). https://github.com/mperham/sidekiq/wiki/FAQ#how-do-i-cancel-a-sidekiq-job
On the other hand I don't see a problem with "queuing another job after the first one" this is how it would work by default, we don't need to do something special for that.
@stanhu From what you wrote I see only problem "The security concern I have is at step 2. It may take a long time to delete all data to change the visibility field". But to be honest I don't think it's serious problem.
@vsizov I think updating every commit, MR, and issue to match the public visibility setting in the DB requires too much CPU time, complexity, and data duplication. This also triggers a reindex on Elasticsearch, which consumes a lot of CPU. What happens if something goes wrong in the middle of this update? Also, canceling Sidekiq jobs is something we don't really do today and seems tricky to get right.
If we are paginating project IDs, what is the issue with passing in a set of project IDs to search against?
@stanhu I propose to setup a call to not spend time of each other, because this discussion will be infinite. Proposed agenda for the call:
Explain why we have current solution and why we need to pass Project IDs
Propose optimization option where we only need to pass Project IDs of private project because user can't have access to infinite amount of private projects but one can have access for lots of public projects like in GitLab.com (more than 200.000 public projects). It does not make sense to build && send this list on every global search. Envolved entities: issues, MRs, milestones (and code && commits will be soon)
@stanhu Actually there is a third evil I forgot to mention - using one index for everything. We use separate index for every model because it's much better from perspective of support and it's default behaviour of elasticsearch-rails. If you change some field in the index mapping you have to reindex whole index. Just imagine we change mapping for issue and now we have to reindex even repositories. Everything. That sounds insane right? This is why we have chosen separate index for every model. But having the one common index enables us to use parent/child relationships in ES, which is kind of SQL JOIN. I think this is a solution for us now.