Running the import takes well over 30 minutes (this does vary based on connection speed and processing power)
There's no progress shown to the user during this time.
Debugging
I've monkey patched the client.rb to print the last api response including the status code. It's clear we're not hitting the GitHub rate limit. The time between sending requests is slow. It seems the bottleneck may be when we create the merge requests/issues on the GitLab side.
After cycle analytics was introduced we see a large number of updates to the merge_request_metrics table when importing.
@stanhu This is a blocker right now for a customer+ moving to GitLab from GitHub. The project they're importing has 14,000 commits, 20 branches, 400 releases, and 3000 pull requests. They've run the import job well over 12 hours and it still going.
For reference I started an import on GitLab.com of GitLabHQ. It's now been running for 11 hours and hasn't completed. I can't see the RepostioryImportWorker running in Sidekiq so it seems the job has stopped yet the import is still in a "processing" state.
@MrChrisW I'm pretty sure that got killed by the infamous MemoryKiller :D In that case, the import remains in the started state forever. This could be tricky since we want the memory to be very low to keep the import going after the 15mins grace period, and it's easy to exceed 1GB if we have other threads also doing some memory consuming stuff :/
So this would be an exercise of two things 1. Keep the memory low as possible. 2. Make it faster.
@stanhu Importing PRs is slow because fetching source ref for each PR from GH is slow, or at least adds up in time, not sure if recent code added another bottleneck or not. Maybe we should investigate running this stage (PR importing) in parallel jobs.
@MrChrisW's case is a bit weird because nothing seems to be imported except for the repo itself; the failed cases we've seen at least got some stuff imported before failing. Probably it got hit early by the memory killer like @jameslopez suggested.
I'm unable to reproduce the "MemoryKiller" theory although I did get half way there (469948kb). Funnily enough during this import test the worker finished successfully, however only the project was imported (no issues or pull requests) - API Log - https://drive.google.com/a/gitlab.com/file/d/0B_4wYK1qcPT1d3JCaFpwV19IUzg/
We're clearly no where near the rate limit yet it's so slow and MR/issues failed but the import is marked as a success. Also no MemKiller
@ahmadsherif Can you do a benchmark the import https://github.com/gitlabhq/gitlabhq and see where the time is going? That would help inform whether your theory of pulling in individual PRs is causing the problem.
Wow! Is this slow because we have to do a git fetch for each pull request that references a deleted commit? How do we make this faster? Could we, for example, do a bulk git fetch? http://stackoverflow.com/a/25098004/1992201
Fetching refs in bulk helped a little; it brought down the time of importing PRs to 8.8 hours. I think there's still room for improvements so I'll continue working on it.
We saw a customer today who had a slow import for one 24000 commit repo, and then that repo was dying on new commits with ProcessCommitWorker throwing a Stack Level Too Deep Error:
we can fetch all refs in a single run, we just need to change git config to do that. It's similar to do a git clone --mirror and how we replicate data in Geo. See:
We have a high priority strategic partner who is considering migrating from GitHub and wants to do a POC of the migration and of GitLab. Currently, @dbalexandre has been assisting them as they failed on the import, as did @dbalexandre. I would really like to see this getting done as soon as possible. @jameslopez cc/ @DouweM
@eliran.mesika this is scheduled for 9.1, and it's a priority. In case there's a lot of work to do here, the idea is to split it and ship something on 9.1 anyway. I am not full-time on this since I was release manager on 8.17 and now on 9.0.
We seeing same issue when trying to import repo by URL from bitbucket. We are running Hosted Gitlab EE. It would be great if we can resolve it. We are blocked :(
https://support.gitlab.com/hc/en-us/requests/71686