In https://gitlab.com/gitlab-org/gitlab-ee/issues/1618 , we moved initial elasticsearch indexing of repositories into sidekiq, as the rake task was taking literally weeks to complete. The database job was "not a bottleneck yet".
I've just had chance to run rake gitlab:elastic:index_database against staging, and it's taken 30 minutes to index just a quarter of the projects. It hasn't gotten to notes, MRs, issues, etc, yet.
So for GitLab.com, we're looking at more than a working day to run this rake task, which I consider to be unreasonable. We're also missing any sort of status feedback for this part of the indexing process.
For 9.0, I've got just enough time to take the existing rake task and parallelise it. We still have a long-running rake task, but this should allow it to complete inside a single working day.
@nick.thomas I'm not sure if threads will work here, because Project is a parent document for rest of the documents. So we would need to make a pipiline: (project, then rest of models in parallel)
But still, I'm not sure if elasticsearch-rails gem is thread-safe, I hope so.
The last time we ran a database indexer on gitlab.com, it took a couple of hours, but now we have a lot more data.
The database indexer's speed didn't matter because it's OK unless it's taking less time than repository indexing. The slowly it works the less it stresses a database.
Right now I'm running into an issue where we have a connection pool size of 5 in gdk, and of course, we have 6 classes to run the indexer on. I think it'll be easier to tweak the default gdk configuration than do anything clever about it here.
I've abandoned threads in favour of rake's multitask, which gives us more flexibility in terms of what we run anyway.
@vsizov are you certain that we need to index projects before indexing child records? I seem to be able to index without projects at all:
gitlab-mbp:gitlab lupine$ rake gitlab:elastic:recreate_indexIndex recreatedgitlab-mbp:gitlab lupine$ bundle exec rake gitlab:elastic:index_databaseI, [2017-03-06T13:31:35.693106 #29527] INFO -- : Indexing MergeRequest...I, [2017-03-06T13:31:35.693740 #29527] INFO -- : Indexing Snippet...I, [2017-03-06T13:31:35.693037 #29527] INFO -- : Indexing Issue...I, [2017-03-06T13:31:35.693820 #29527] INFO -- : Indexing Note...I, [2017-03-06T13:31:35.693883 #29527] INFO -- : Indexing Milestone...I, [2017-03-06T13:31:36.056019 #29527] INFO -- : Indexing Issue... doneI, [2017-03-06T13:31:36.127640 #29527] INFO -- : Indexing MergeRequest... doneI, [2017-03-06T13:31:36.279106 #29527] INFO -- : Indexing Snippet... doneI, [2017-03-06T13:31:36.358592 #29527] INFO -- : Indexing Milestone... doneI, [2017-03-06T13:31:38.707780 #29527] INFO -- : Indexing Note... done
The amount of time this rake task task to complete is important. It's an unfair burden on our infrastructure people to give them ad-hoc processes that they have to keep an eye on for multiple days, and having a single error somewhere in the process mean that you have to restart from the beginning, per https://gitlab.com/gitlab-org/gitlab-ee/issues/1840 , is not good for morale.
If I can't get this job to complete reliably on staging within the timescale of a single working day, I'll be unhappy.
Moving this backfill entirely into sidekiq is probably the right thing to do longer-term, but, I've got a day :D.
Indexing without projects seems to work more or less as expected. There are no search results until the project gets indexed, but once it is, the issues indexed before the project was indexed are made searchable.
I think ES team just changed this behavior in 5.x. Because in 2.4 it failed. But anyway, this is weird, parent document determines in what node shard the child document will be saved.
@nick.thomas one thing to consider here is that if it takes short but collapses the database or the filesystem we will just not run it, that would mean to require downtime to enable it.
So consider including some form of backpressure to prevent it from killing the site completely.
The changes increase the database load of the job by 6x relative to that. If you feel that's likely to cause problems, I can introduce an additional wait between batches (we read some data from the db, push to elasticsearch, repeat, so there's already some waiting).
Since we have six tables to work on concurrently, and given the times the other tables take, that suggests the total time to index the database will be around the five-hour mark, with peak tps at ~3000/sec, dropping to 1000tps/sec after the first hour and 500/sec after 2 hours.
Production seems to hold steady at 10-15Ktps without maxing out CPU or discs, so this isn't an unprecedented amount of load. If it does cause problems, the current codebase still supports running the jobs in series rather than in parallel. Overall, I think I'm happy.