Currently our 99% request response time is more than 5 seconds. By the dates below 99% of GitLab http(s) requests will take less than the mentioned time:
As discussed today in the infrastructure meeting, our first focus is the issues page, we will use this page to understand what are the main offenders and how do we need to act on each case.
This will be used as an experience to write the corresponding documentation that will be honoured by the whole development team so we make GitLab fast.
This issue should perhaps contain a table of measurements, with (at least) one measurement per version, to track progress over time. It currently has the text "3,625 ms at one point", which will not be very useful for a tracking perspective.
@pcarranza My reference is mainly to the issue header, which contains the statement "GitLab.com issue https get => 8,444 ms earlier, 3,625 ms at some point, 6,052 ms on 2016-06-23". This would be much more transparent (and probably usable) if it contained a simple table of measurements, perhaps weekly, perhaps a day after each deployment, indicating progress. Natural columns could be Date, Version and Milliseconds.
Yorick's fix for referenced mentionables also made a big difference, and I would assume the diff improvements were useful, and of course plenty of smaller improvements!
Stan: f I had to guess, it would be a few percentage points off the original ~6 s, so maybe at most .3 s savings. I think the large chunks of time savings came from reducing the number of DB queries, eliminating the need of calling referenced_mentionables again for system notes, memoizing the author’s badge level, and reducing the number of network round-trips to look up Banzai references from the cache. before, we were spending 30% of the page load time of operations/42 just trying to find the lousy owner badge for all notes (edited) and another 20% trying to figure if we should show each system note to the user (edited)
I'm not certain, I'll add a monitoring for HTTP also to be sure about this.
The ssh key lookup is done using a database query, and that has a lookup speed of 11ms on avg + 200ms for the allowed call. (check the graph underneath) The only issue here is that it may happen that more than 1 key lookup is necessary depending on how the user has the ssh keys setup, if you only use one key for all (most users) that will not be an issue.
My gut feeling with all this is that the NFS server is the main bottleneck. It is maxing out all the time in CPU time dedicated to IO (check graph) and our last finding is also pointing in the direction that the way the LVM is setup (using linear) is not efficient at all, and it is generating a lot of overhead by fragmenting the actual filesystem for reads (on rewrites) and by concentrating all the writes into one single drive at time.
The first test I want to make is to create the new NFS server, and add monitoring on that one to rule out completely the fact that the NFS server is the main bottleneck, hopefully this should happen today.
@sytses There are 2 reasons why we are not pushing everything to public:
there are some things being monitored in dev.gitlab.org.
our private monitoring is used as the testbed to pull things into public.
If you take a look at our private monitoring it simply is a mess. I've to clean it up but didn't had the time yet, and I like to give something already chewed, simple and easy to read to understand how things are working to users, so I run the experiments in the private one.
I'm good making all the blackbox monitoring public (even though it's a hack at this point), and providing some crucial and interesting metrics to the public from whatever we are running internally (I don't think that all the user care a lot about how many merged writes per second are happening in one node in our infrastructure), but we are still working in setting up the general monitoring, so we are not there yet.
Regarding the NFS, it's an assumption at this point strongly backed up with how our other servers and environments are working. The thing I'm deadly curious about is how much the LVM setup changes everything. But I think that we are going to have a CephFS shard before we start moving projects from one shard to the other.
We are cautiously optimistic the dbus patch, which required a full reboot of our worker machines, has helped bring SSH latencies down to reasonable levels:
gdk-ee(use-ee)$ for i in {0..20} ; do { time git ls-remote git@gitlab.com:gitlab-org/gitlab-development-kit.git ; } 2>&1done | awk 'BEGIN {total=0 ; max=0 ; min=10} /^git/ {if($10 > max) max = $10} {if ($10 != "" && $10 < min) min=$10} {total+=$10} END {print "total " total " avg " total/20 " min " min " max " max}'total 58.48 avg 2.924 min 2.518 max 3.210gdk-ee(use-ee)$ for i in {0..20} ; do { time git ls-remote https://gitlab.com/gitlab-org/gitlab-development-kit.git ; } 2>&1done | awk 'BEGIN {total=0 ; max=0 ; min=10} /^git/ {if($10 > max) max = $10} {if ($10 != "" && $10 < min) min=$10} {total+=$10} END {print "total " total " avg " total/20 " min " min " max " max}'total 24.843 avg 1.24215 min 1.002 max 1.667
https is still 2x faster, but we are getting there.
It looks to me that the resolve comment feature is partly to blame for adding a significant performance regression here due to this permission check for every note:
We should also mention the outstanding improvements in large diff loading made by @yorickpeterse now in 8.12. Our test case is a Linux kernel commit that used to time out (> 60 s):
At the moment, it seems to me that we are hurting ourselves by loading Project for each note. This appears to be a regression because in the past I don't think we loaded the same Project over and over for a single issue. Can we look into this?
Above progress looks great and thus we moved to Gitlab cloud, but page load time for us is still very slow. Also Merge Requests with a significant changes mostly gives 500 errors.
Personally, how fast the backend is really don't matter much to me[ie git push]
For the most part, loading a page is annoying - but not the end of the world.
It's the damn dynamic ajax lookups for all the little "assignee", "label", etc. Those items just do not change very frequently, the lists are mainly static. Stop using ajax and push the data into the javascript of the page. Your already dynamically generating these pages and I'd much rather have the reload the page if for some reason a new label was added then spend an extra 2 seconds for every single issue/merge request waiting for one of those lists of options to appear.
Even better would be getting a competent frontend/javascript coder and moving that information into the offline/local browser storage. Add in a little async for the updates and it would actually be possible to use Gitlab even when it is heavily loaded - the updates will just be delayed a few extra seconds.