New Relic data can't be trusted as it only runs on a single server, our handy InfluxDB setup is much more reliable on the other hand. Transaction timings from the past 24 hours:
The response time of the last 24 hours is 6.7 seconds based on the following query:
select percentile("duration", 99) / 1000 from rails_transactions where time > now() - 24h;
These timings are Rails application timings only, they don't include any time spent in Workhorse, Nginx, etc.
@yorickpeterse Why do we only have one server reporting to NewRelic? AFAIK you can point many servers at one application in NewRelic and won't incur additional costs. It's all priced via application.
@sytses I'm not sure just yet. I've seen a few bits of code being called in different controllers while being slow in all cases, so I'll look at that in the coming days. Basically my next goal will be finding/resolving bits of badly performing code that are shared between controllers. Fixing those cases should hopefully lead to a lower overall response time.
@dblessing I'm not sure what the exact reasoning is/was, but I'm pretty sure you pay per host/server, unless this changed in the past.
@dblessing New Relic is prices per server, $150 * 20 servers is too much. And I think it is likely that having it on all servers would not improve the data, it should already be in the ballpark when running on a single server since traffic is distributed evenly by HAproxy.
In https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/2421 the topic of increasing the Unicorn memory limit was brought up. We discovered the limits shipped with CE and Omnibus differ and are currently testing (by just letting it run for an hour or so while collecting data) a slightly more reasonable limit. Ideally this would cut down the amount of worker restarts which in turn could reduce request timings (due to requests not having to wait for workers to become available). Should this help we'll try to also get this into 8.4.
I've added dev.gitlab.org to the description which has normal latency, so the problems are due to the size of GitLab.com in all it's aspects (more users, more issues, larger db, spread amoung multiple machines, higher load, etc.).
I'll be digging into more low level operations and setting up metrics there. I suspect performance is affected by either the Azure network in general, or our NFS cluster. As a start to clear things up I'll be instrumenting Rugged (besides Gitlab::Git which we already instrument) just in case we use this somewhere directly. I'll also make sure any other Git related code is instrumented.
In an effort to make local testing of Git operations easier I'll be looking into a way to basically slow down Git/IO operations. Because I use an SSD with fairly little operations going on GitLab locally performs way better than GitLab.com making it much harder to determine the impact of certain operations. Basically I just want to add something like sleep(100 ms) before an operation to mimic a slow NFS.
In the mean time I'll continue digging through the exiting metrics/tooling/etc to see what I can find on GitLab.com.
To set up a slow file system I used the following Fish script (after a day of googling/trying things out, etc):
#!/usr/bin/env fishsudo tc qdisc del dev lo rootsudo tc qdisc add dev lo root netem delay 2mssshfs -o cache=no -f \ yorickpeterse@localhost:/home/yorickpeterse/Projects/gitlab/repositories \ /mnt/repositoriessudo tc qdisc del dev lo root
This mounts my repositories directory under /mnt/repositories and slows down traffic by 2ms.
Certain database queries that run on many pages while not always needed. For example, BroadcastMessage.current was called on a lot (if not all) pages and doing DB queries when it could easily be cached. @joshfng fixed this in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/2633.
Measure simple ping timings from a worker to the NFS and database ICMP seems to be disabled on either the workers or the NFS as a ping from worker11 doesn't yield a response.
Measure read/write (e.g. just a dumb "Hello world") timings
Measure simple Rugged (as in, without involving Gitlab::Git) operations from a worker (e.g. grabbing the latest commit of the master branch)
The idea of the above is to get a better understanding of the baseline latency that's present in GitLab.com. For example, if reading from a text file already takes 50 ms more complex Git operations will only end up taking longer.
Next I decided to put some extra pressure on the NFS while running a benchmark. For this I used the following script:
require'benchmark'require'rugged'path='/var/opt/gitlab/git-data/repositories/gitlab-org/gitlab-ce.git'timings=[]start=Time.nowstress=truethreads=5.times.mapdoThread.newdowhilestressRugged::Repository.new(path)endendendwhileTime.now-start<=10timings<<Benchmark.measure{Rugged::Repository.new(path)}.real*1000endstress=falsethreads.each(&:join)timings=timings.sortavg=timings.inject(:+)/timings.lengthmin=timings[0]max=timings[-1]puts"Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"
This basically starts 5 threads that constantly initialize a repository while the main thread does the same and measures the timings. The idea is to see if/how high load influences timings. This script spits out the following:
Minimum: 0.4011 ms | Maximum: 1703.291789 ms | Average: 8.02635465 ms
Next I adjusted the script to dump the timings to a text file so I could plot it. The raw timings can be found at timings.txt.
This was plotted using the following gnuplot script:
set title "Timings"set xlabel "Iteration"set ylabel "Time (ms)"set rmargin 10set term svg size 1400,500set gridset style line 1 lc rgb "#0060ad" lt 1 lw 1set style data linesplot "/tmp/timings.txt" using 1 ls 1 title "Timings"
The graph looks as following:
What's interesting here is the recurring spike of ~500 ms. Based on this it looks like the NFS works fine when there's little load but starts to misbehave when load increases.
Fetches => Number of table rows fetched by bitmap scans when argument is a table, or table rows fetched by simple index scans using the index when argument is an index
Almost 1m fetches a second! We should really bring that down.
@sytses some things that might be worth noting, based on some poking around I've done:
autocomplete_source.json is loaded for every single Issue and Merge Request. For this issue, it took more than a second to load and includes every single issue, user, and merge request in a given project. This can't be good for the site's speed, and is an extra (in this case) 175KB that shouldn't be loaded unless explicitly needed. For the gitlab-ce repository, it's more than 350KB! Unless I'm missing something that necessitates this, it should not be being served on every single page load.
It looks like a file called notes is loaded every ~15 seconds which checks for updates to content on the page. This may not cause a huge performance burden, but it's something simple that may be beneficial to change to ~45 second intervals, if it's even necessary at all. I currently have 5 GitLab tabs open, meaning that I'm polling the notes endpoint 20 times per minute. These things add up.
Emoji are being served in the aforementioned autocomplete_source.json file as well as as an 813KB PNG sprite file. Which are we using, and why are we loading these twice?
The award emoji picker is loaded inline rather than dynamically with JavaScript. I would think something like that might be used in - at max - 3% of Issue page loads. Probably less. Could see slightly faster load times if that were taken out.
Non-Gravatar avatar images (such as my own) are loaded at their full resolution, usually between 50KB and 150KB per avatar with a resolution around 800px - if not more. The Gravatar images are always less than 10KB, served over HTTP2, and are only 48px. Some huge potential gains to be had there.
Have you looked into using a CDN (I'd strongly recommend CloudFlare, I use it for my Rails apps)? Could be a huge performance improvement and take significant pressure off your servers. I realize it's not open source, but seeing as the CI Builds run on Digital Ocean and Azure, that's probably not a huge problem.
Sorry this was a little long-winded, or if the tone sounds negative in any way. It's not intentional, it's just late at night here and my brain isn't working at full capacity :P
Testing issue #1 (closed) locally using Sherlock reveals the following:
Out of the ~21 seconds it takes to load this issue (using my slower file system) only 0.45 seconds is spent running SQL queries.
Looking at the line profiler output most time appears to be spent in app/models/repository.rb and the Markdown rendering pipeline. Details below.
app/models/repository.rb
84% of the time (roughly 15 seconds) is spent in Repository#empty? on line 47
9% of the time (roughly 1.6 seconds) is spent in Repository#has_visible_content? on line 60
6% of the time (roughly 1 second) is spent in Repository#root_ref on line 483
The remaining time is spread across a few different lines with most timings taking only a few dozen milliseconds
lib/banzai/filter/relative_link_filter.rb
This file makes the calls to Repository#empty? as well as Repository#exists? in Banzai::Filter::RelativeLinkFilter#linkable_files? on line 32. All the time spent in this file is spent on this line.
Conclusion
So loading issues seems to be slow because it's doing a bunch of Git operations that don't have to be performed for every request. Rugged internally uses git_repository_is_empty to check if a repository is empty (https://libgit2.github.com/libgit2/#HEAD/group/repository/git_repository_is_empty), which as far as I can tell is only true before the first commit is ever pushed. For example, if you follow the following steps:
Initialize a repo
Add a file and commit it
Remove said file, commit the changes
Run Rugged::Repository.new('path/to/repo').empty?
Then step 4 will return false. As a result of this I think we can just cache the output of Repository#empty? forever. Example shell session:
The output of Repository#has_visible_content? can also be cached as it simply checks if the number of branches is greater than 0. This cache could be flushed whenever the last branch of a repository is removed (we might need an extra hook of sorts for this).
Repository#root_ref can probably also be cached as it simply returns the name of the default branch. The cache for this can be flushed whenever the user changes the default branch. Caching this further reduces the loading time to around 2.8 seconds. The diff would be:
diff --git a/app/models/repository.rb b/app/models/repository.rbindex e813c94..086e989 100644--- a/app/models/repository.rb+++ b/app/models/repository.rb@@ -44,7 +44,7 @@ class Repository end def empty?- raw_repository.empty?+ cache.fetch(:empty) { raw_repository.empty? } end #@@ -57,7 +57,9 @@ class Repository # This method return true if repository contains some content visible in project page. # def has_visible_content?- raw_repository.branch_count > 0+ cache.fetch(:has_visible_content) do+ raw_repository.branch_count > 0+ end end def commit(id = 'HEAD')@@ -480,7 +482,7 @@ class Repository end def root_ref- @root_ref ||= raw_repository.root_ref+ @root_ref ||= cache.fetch(:root_ref) { raw_repository.root_ref } end def commit_dir(user, path, message, branch)
Created another merge request that makes our cache flushing less aggressive: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/2769. tl;dr previously branch statistics were flushed upon every push (for all branches), now we only flush for branches where it's actually needed. A push to the default branch still results in all branch caches being flushed.
Does this mean it takes 800 milliseconds to initialize a repository? If not, maybe it's best to run through these graphs and their meanings during the devops meeting, saves pinging 17 participants in this issue :)
To test the impact of 8.5 RC1 so far I ran the following query:
select percentile("duration", 95) / 1000 as "duration_95th" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-01-11 14:25:00' and time < '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';
In other words, for the 24 hours leading up to the deploy the 95th percentile timing was roughly 11.5 seconds.
Next I ran the following query:
percentile("duration", 95) / 1000 as "duration_95th" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';
This calculates the 95th percentile since today's 8.5 deploy, the output is as following:
And for the sake of it, here's the timings when we calculate the mean instead of the 95th percentile:
> select mean("duration") / 1000 as "mean" from rails_transactions where action = 'Projects::IssuesController#show' and timestamp > '2016-01-11 14:25:00' and timestamp < '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';name: rails_transactions------------------------time mean0 8.047509426767022> select mean("duration") / 1000 as "mean" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';name: rails_transactions------------------------time mean1455200700000000001 2.933122648561403
Per this comment in a thread on switching to GitLab, pulling the same repo from GitHub vs. GitLab, GitLab takes significantly longer:
GitHub:
❯❯❯ time git pull origin masterFrom github.com:CWSpear/local-persist * branch master -> FETCH_HEADAlready up-to-date.git pull origin master 0.02s user 0.04s system 4% cpu 1.396 total
GitLab:
❯❯❯ time git pull gitlab masterFrom gitlab.com:CWSpear/local-persist * branch master -> FETCH_HEADAlready up-to-date.git pull gitlab master 0.02s user 0.02s system 0% cpu 6.837 total
Interestingly, cloning a new repo from scratch (in this case it isn't a very big repo), GitHub leads by only about half a second. Tested from Colorado.
The most notable difference between the two that I can find is that GitHub has remote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708 whereas GitLab has remote: Total 708 (delta 370), reused 708 (delta 370). Not sure what that means exactly, but maybe it makes a difference?
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/2987 will make global search around 10x faster (at least). For example, the time for the search query "yorick" went from > 30 seconds to 2-3 seconds. Searching for "git" now takes around 5-6 seconds instead of timing out (meaning it used to take over 2 minutes at least).
Project specific search queries have also been improved. For example, searching for "git" in the GitLab CE repository takes around 2-3 seconds.
This applies to setups using CE's search feature, these changes do not apply to ElasticSearch based searching.
Merge request gitlab-org/gitlab-ce!3123 (and https://gitlab.com/gitlab-org/gitlab_git/merge_requests/69) will hopefully make displaying commit logs faster. What the final impact is of these merge requests on GitLab.com has yet to be tested though.
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/3128 adds some methods for count and caching the number of branches and tags. Using these methods the loading times of a project's branches page can be improved up to 1.45x, in my local setup the timings went from 6.65 seconds on average to 4.6 seconds.
I'm currently looking into the following two issues:
gitlab-org/gitlab-ce#13651
gitlab-org/gitlab-ce#14315
The first issue affects viewing individual issues and has a pretty significant impact on performance. The second issue affects basically everything where we render Markdown, though it will probably be quite hard to resolve as the Markdown rendering pipeline is quite complex.
Wanted to note that we don't have GZIP enabled on GitLab.com, from gitlab-org/gitlab-ce#14344:
Per Google PageSpeed, we could reduce this page's size 1.4MB by enabling gzip. That's a huge difference.
PageSpeed gives that issue page a 37/100 on Desktop and a 31/100 on Mobile. We shouldn't base everything off the PageSpeed stats, but we should aim to at least have a 70/100 on both. I don't claim to be a security researcher, but from what I can tell this is an unnecessary precaution, and I wanted to make sure it was discussed more. Based on gitlab-org/gitlab-ce#4088, I think this can just as easily be mitigated by removing SPDY support.
Merge request gitlab-org/gitlab-ce!3389 was merged and is currently running on GitLab.com along with the other changes that made it into 8.7.0 RC1. I'll continue to monitor performance in the next hour or so to see what the impact is of this RC. After that I'll continue looking into the performance of Issue#referenced_merge_requests and associated code as this appears to currently be the main cause of loading issues being slow.
The records to render are already fetched, thus the column containing the pre-rendered HTML (if any) is also already there. This removes the need for any extra network calls.
Data cached in Redis expires after 2 weeks (IIRC), storing this in the DB works around that. The cache expiring means that every now and then we'd have to re-build the HTML for all data involved (e.g. we'd have to rebuild the HTML of all notes in this particular issue)
Since this will take a few days (at least) to implement I'll first look into loading the "Related merge requests" section of an issue asynchronously. While this won't make that part faster it will at least cut down the loading time of the document request when loading an issue.
Git wise, when viewing individual issues there's still Git code involved that's slow as a snail. For example, Rugged::BranchCollection#each seems to be called somewhere and in the worst case takes around 1.6 seconds to complete. There also seems to be revision parsing going on using Rugged::Repository#rev_parse (taking up to half a second), this is probably used for rendering commit references. Other than that there are a handful of random Git related methods running but most only take 20-30 milliseconds with an occasional outlier (again probably used for rendering Git related references).
Thanks @yorickpeterse, removing Rugged::BranchCollection#each calls in individual issues looks important to me. Good idea to make "Related merge requests" async.
@yorickpeterse I'd be interested in seeing if changes in the next RC make any difference. Are those timings for serverside rendering or clientside? If clientside, I would think the 99th percentile would decrease some with the smaller page size, although I may be wrong.
@connorshea All timings are server side. Due to InfluxDB overwriting data I have to re-verify what the actual impact of RC1 was (as I no longer trust the data we had at that time). Our timings did quite significantly drop the moment we fixed InfluxDB (see https://gitlab.com/gitlab-com/operations/issues/175) so maybe RC1 did have the desired impact.
Worth mentioning: I managed to more or less confirm that on the page used for displaying single issues (Projects::IssuesController#show) Git code is no longer the main offender, at least for 95% of the requests:
Here the little boxes at the top (ignore the colors of the text, I'm still making adjustments to the thresholds) show the 95th percentiles of the various timings. There are still some random spikes (e.g. in one instance Rugged::Repository.new took over 7 seconds) but in 95% of the requests we only seem to spend around 85 milliseconds in either Gitlab::Git or Rugged code.
It seems to me there is a negative correlation between number of requests and timings, which I find a bit weird and perhaps even suspicious: Peak number of requests occur between 17 and 18, which is also a period of low mean timing. Likewise, the spike in requests around 20:30 or so corresponds to a very low mean timing.
I would instinctively expect that the high number of requests would increase the load on the databse, and lead to (s)lower performance.
@elygre There's no guarantee that more requests equals higher timings as this all depends on what those requests are doing. There may also be external factors (e.g. requests for other controllers) that can impact timings.
Small status update since it's been a while in this issue:
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/3969 adds support for batch retrieval of Markdown references and refactors the code used for getting participants of an Issuable. This MR as a whole greatly cuts down the number of queries executed when viewing individual issues/merge requests (or any code that happens to get participants of a Participable)
I'm working on another set of changes that cut down the number of queries executed in IssueReferenceFilter when rendering issue references. This is still a giant hack so there's no MR available for this yet (I hope to create one tomorrow or so)
@sytses Yes, they landed in RC1. In various cases query counts dropped significantly, though timings were not reduced as much as I hoped. Having said that we did have the noise of Unicorn workers being killed very quickly and RC1 introduced some changes that ended up running extra SQL queries for every note rendered.
@sytses that's the avg for 7 days, the current numbers are down to ~3.5 seconds after the change on memory that happen last Thursday. So I'm expecting that number to go below 4s to roughly 3.5 in the next days.