Make GitLab.com fast

mentioned in issue operations#42 (closed)

As discussed today in the infrastructure meeting, our first focus is the issues page, we will use this page to understand what are the main offenders and how do we need to act on each case.

This will be used as an experience to write the corresponding documentation that will be honoured by the whole development team so we make GitLab fast.

https://gitlab.com/gitlab-org/gitlab-ce/issues/15254

cc/ @yorickpeterse

Added performance in Operations label

Moved from operations#182 (closed)

Added ~366505 label

Added performance label

Will monitor dev so that features that hurt performance can be identified.

mentioned in issue gitlab-org/University#26

We added performance metrics to dev.gitlab.org for a start: https://gitlab.com/gitlab-com/infrastructure/issues/106

This issue should perhaps contain a table of measurements, with (at least) one measurement per version, to track progress over time. It currently has the text "3,625 ms at one point", which will not be very useful for a tracking perspective.

@elygre This is what we have so far for black box monitoring:

dev.gitlab.org/gitlab/organization/issues/712
dev.gitlab.org/gitlab/organization/issues/782
dev.gitlab.org/gitlab/organization/issues/608
dev.gitlab.org/gitlab/organization/issues

I also plan to add monitoring for MR's and diffs.

We have that between staging and prod also, I just don't have the graphs at hand now.

@elygre Also consider that our dev instance is a rolling nightly deploy, there are no versions per se.

@pcarranza My reference is mainly to the issue header, which contains the statement "GitLab.com issue https get => 8,444 ms earlier, 3,625 ms at some point, 6,052 ms on 2016-06-23". This would be much more transparent (and probably usable) if it contained a simple table of measurements, perhaps weekly, perhaps a day after each deployment, indicating progress. Natural columns could be Date, Version and Milliseconds.

Interesting optimizations happening in https://gitlab.com/gitlab-org/gitlab-ce/issues/19273

Milestone changed to %Milestone-1

Next immediate goal: get under 4.5 seconds in the issues page - https://gitlab.com/gitlab-com/infrastructure/issues/193

The ping time dropped from 6 to 2 seconds after deploying 8.10.rc! https://twitter.com/connorjshea/status/754081076801265664 suggests Hamlit helps a lot but there can be multiple reasons.

Yorick's fix for referenced mentionables also made a big difference, and I would assume the diff improvements were useful, and of course plenty of smaller improvements!

(I'd find links but I'm on mobile)

BTW @elygre, here you can monitor progress: http://monitor.gitlab.net/dashboard/db/gitlab-status

MR commit latency spiked but that should be solved with https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/5278

Stan: f I had to guess, it would be a few percentage points off the original ~6 s, so maybe at most .3 s savings. I think the large chunks of time savings came from reducing the number of DB queries, eliminating the need of calling referenced_mentionables again for system notes, memoizing the author’s badge level, and reducing the number of network round-trips to look up Banzai references from the cache. before, we were spending 30% of the page load time of operations/42 just trying to find the lousy owner badge for all notes (edited) and another 20% trying to figure if we should show each system note to the user (edited)

This regression is fixed in this MR

This biggest current problem seems to be the ssh access latency.

I've noticed it myself last week and I saw a comment on HN a minute ago https://gitlab.com/gitlab-com/infrastructure/issues/59

I tested it locally with our website repository git@gitlab.com:gitlab-com/www-gitlab-com.git that was up to date:

2.1.5 in source/ on master
› time git pull
Already up-to-date.
git pull  0.10s user 0.08s system 0% cpu 26.069 total

26 seconds to do a git push is unacceptable. Of course we moving to multiple NFS servers and Ceph. But I do have the following questions.

We implemented monitoring for SSH I believe but I can't find it on http://monitor.gitlab.net/dashboard/db/gitlab-status where is it?
It is this only SSH or also for HTTP access?
I believe that SSH key lookup is now done in the database. Is the delay due to file system access time or something else?

@sytses

It was in the private monitoring only, I've just added to the public one http://monitor.gitlab.net/dashboard/db/gitlab-status?panelId=12&fullscreen
I'm not certain, I'll add a monitoring for HTTP also to be sure about this.
The ssh key lookup is done using a database query, and that has a lookup speed of 11ms on avg + 200ms for the allowed call. (check the graph underneath) The only issue here is that it may happen that more than 1 key lookup is necessary depending on how the user has the ssh keys setup, if you only use one key for all (most users) that will not be an issue.

My gut feeling with all this is that the NFS server is the main bottleneck. It is maxing out all the time in CPU time dedicated to IO (check graph) and our last finding is also pointing in the direction that the way the LVM is setup (using linear) is not efficient at all, and it is generating a lot of overhead by fragmenting the actual filesystem for reads (on rewrites) and by concentrating all the writes into one single drive at time.

The first test I want to make is to create the new NFS server, and add monitoring on that one to rule out completely the fact that the NFS server is the main bottleneck, hopefully this should happen today.

Authorized keys latency

Allowed latency

NFS Server CPU IO

@pcarranza thanks so much for this. Let's assume it is the NFS server and it will get better with the new server by spreading the load.

Any reason why we have private and public monitoring? Can we make all new monitoring public?

@sytses There are 2 reasons why we are not pushing everything to public:

there are some things being monitored in dev.gitlab.org.
our private monitoring is used as the testbed to pull things into public.

If you take a look at our private monitoring it simply is a mess. I've to clean it up but didn't had the time yet, and I like to give something already chewed, simple and easy to read to understand how things are working to users, so I run the experiments in the private one.

I'm good making all the blackbox monitoring public (even though it's a hack at this point), and providing some crucial and interesting metrics to the public from whatever we are running internally (I don't think that all the user care a lot about how many merged writes per second are happening in one node in our infrastructure), but we are still working in setting up the general monitoring, so we are not there yet.

Regarding the NFS, it's an assumption at this point strongly backed up with how our other servers and environments are working. The thing I'm deadly curious about is how much the LVM setup changes everything. But I think that we are going to have a CephFS shard before we start moving projects from one shard to the other.

I forgot to mention that if the DB key lookup fails, openssh will perform a second lookup in the file that will also fail.

I'll open an issue to make the file optional as we should not be having it anymore, and offering this setup to all the customers.

We are cautiously optimistic the dbus patch, which required a full reboot of our worker machines, has helped bring SSH latencies down to reasonable levels:

More details here: https://gitlab.com/gitlab-com/infrastructure/issues/290#note_13536786

Look a this:

gdk-ee(use-ee)$ for i in {0..20} ; do
  { time git ls-remote git@gitlab.com:gitlab-org/gitlab-development-kit.git ; } 2>&1
done | awk 'BEGIN {total=0 ; max=0 ; min=10} /^git/ {if($10 > max) max = $10} {if ($10 != "" && $10 < min) min=$10} {total+=$10} END {print "total " total " avg " total/20 " min " min " max " max}'
total 58.48 avg 2.924 min 2.518 max 3.210

gdk-ee(use-ee)$ for i in {0..20} ; do
  { time git ls-remote https://gitlab.com/gitlab-org/gitlab-development-kit.git ; } 2>&1
done | awk 'BEGIN {total=0 ; max=0 ; min=10} /^git/ {if($10 > max) max = $10} {if ($10 != "" && $10 < min) min=$10} {total+=$10} END {print "total " total " avg " total/20 " min " min " max " max}'
total 24.843 avg 1.24215 min 1.002 max 1.667

https is still 2x faster, but we are getting there.

@pcarranza awesome results, thank you very much. "making all the blackbox monitoring public" would be great

@sytses We are still working through them, as they started acting on after a few hours. Better, but not good yet.

We plan to ship monitoring within GitLab itself, both blackbox and whitebox, including alerting.

Great progress in the last hour, we are getting to a stable behavior in our ssh monitoring.

We will try to push this upstream to ubuntu so it helps the whole community.

Milestone changed to %Milestone-2

@pcarranza http://stats.pingdom.com/81vpf8jyr1h9/1902794 has gone up from 2.1 seconds on 2016-07-15 after 8.10.rc to 3.7 now

mentioned in issue marketing#472 (closed)

It looks to me that the resolve comment feature is partly to blame for adding a significant performance regression here due to this permission check for every note:

This line is responsible: https://gitlab.com/gitlab-org/gitlab-ce/blob/v8.11.5/app/views/projects/notes/_note.html.haml#L3

2016-09-08-canon.zip

I think we're also impacted by the DB load variability:

Milestone changed to %Milestone-3

gitlab-org/gitlab-ce!6298 should improve the performance of issue load times, but the problem still will occur in merge requests with lots of notes.

Thanks for that @stanhu

BTW I love the improvements suggested in https://gitlab.com/gitlab-org/gitlab-ce/issues/21899, maybe we can ask someone other than @yorickpeterse to work on them so he has time to find more of this.

We should also mention the outstanding improvements in large diff loading made by @yorickpeterse now in 8.12. Our test case is a Linux kernel commit that used to time out (> 60 s):

https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36

Now, it "only" takes 13 seconds. We're making progress; still plenty of room to go, but this is a good start.

Milestone changed to %Milestone-4

https://gitlab.com/gitlab-org/gitlab-ce/issues/1 was 2,095ms on 2016-07-15 after 8.10.rc but is 3,986 ms now, almost double.

@pcarranza @pacoguzman @yorickpeterse Take a look at the latest benchmark of operations/42 (which also applies to the last comment):

At the moment, it seems to me that we are hurting ourselves by loading Project for each note. This appears to be a regression because in the past I don't think we loaded the same Project over and over for a single issue. Can we look into this?

canon-pub.zip

Awww.... man...

Thanks @stanhu great catch.

@pacoguzman could you please take on this?

Yep, I'm looking into this

The load time is now 1,493 ms on 2016-10-11 after 8.12.5 order by fix, within a tenth of a second from GitHub's issue page loading time.

@sytses we'll have to check back in a few days to see if it stays that way since there have been drops like that before, but wow... that's insane.

We are in ~1.5 consistently for 2 days so far. This came to stay.

That graphic looks really nice, but it's still very hard for me to load the gitlab.com front page in less than 5 seconds.

I'm having similar experiences with the web frontend, but in addition, git push is also slow.

Just giving some input if it helps.

Above progress looks great and thus we moved to Gitlab cloud, but page load time for us is still very slow. Also Merge Requests with a significant changes mostly gives 500 errors.

Mentioned in issue operations#182 (closed)

Personally, how fast the backend is really don't matter much to me[ie git push] For the most part, loading a page is annoying - but not the end of the world.

It's the damn dynamic ajax lookups for all the little "assignee", "label", etc. Those items just do not change very frequently, the lists are mainly static. Stop using ajax and push the data into the javascript of the page. Your already dynamically generating these pages and I'd much rather have the reload the page if for some reason a new label was added then spend an extra 2 seconds for every single issue/merge request waiting for one of those lists of options to appear.

Even better would be getting a competent frontend/javascript coder and moving that information into the offline/local browser storage. Add in a little async for the updates and it would actually be possible to use Gitlab even when it is heavily loaded - the updates will just be delayed a few extra seconds.

mentioned in issue #947 (closed)

Moved to https://gitlab.com/gitlab-com/infrastructure/issues/947

closed

changed the description

Make GitLab.com fast

Designs

Child items ...

Activity

Authorized keys latency

Allowed latency

NFS Server CPU IO

Admin message

Admin message

Make GitLab.com fast

Activity

Authorized keys latency

Allowed latency

NFS Server CPU IO