- Aug 31, 2016
-
-
Douwe Maan authored
Parse Git attribute files using Ruby Commit 340e111e contains all the details. It's quite the read so the short summary is: > Rugged is slow as heck because it runs multiple IO calls every time you request a set of Git attributes. gitlab_git now provides a pure Ruby parser that avoids this and is between 4 and 6 times faster. Here's a Grafana screenshot to show how bad it can get:  See https://gitlab.com/gitlab-org/gitlab-ce/issues/10785 for more information. See merge request !121
-
Yorick Peterse authored
Rugged provides a way of parsing Git attribute files such as the one located in $GIT_DIR/info/attributes. Per GitLab's performance monitoring tools quite a lot of time can be spent in parsing/retrieving attributes. This commit introduces a pure Ruby parser for gitlab_git that performs drastically better than the one provided by Rugged. == Production Timings As an example, take the commit https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36 (as taken from https://gitlab.com/gitlab-org/gitlab-ce/issues/10785). When loading this commit we spend between 4 and 6 seconds in Rugged::Repository#fetch_attributes. This method is called around 1100 times. This is the result of two problems: 1. For every diff we call Gitlab::Git::Repository#diffable? and pass it a blob. This method in turn returns a boolean (based on the Git attributes for the blob's path) indicating if the content is diffable. 2. For every diff we use the GitLab class Gitlab::Highlight which calls Repository#gitattribute in the #custom_language method. This is used to determine what language to use for highlighting a diff. As a result in the worst case we'll end up with 2 calls to Gitlab::Git::Repository#attributes (previously delegated to Rugged::Repository#attributes). == Rugged Implementation Rugged in turn implements the "attributes" method in a rather in-efficient way. The first time this method is called it will run at least a single open() call to open the file. On top of that it appears to run 2 stat() calls for every call to Rugged::Repository#attributes. In other words, if you call it a 100 times you will end up with 201 IO calls: * 200 stat() calls * 1 open() call == Rugged IO Overhead To confirm the IO overhead of Rugged I created the following script (saved as "confirm.rb"): require 'rugged' path = '/tmp/test/.git' repo = Rugged::Repository.new(path) 10.times do repo.attributes('README.md')['gitlab-language'] end I then ran this as follows: strace -f ruby confirm.rb 2>&1 | grep -i 'info/attributes' | wc -l This counts the number of instances an IO call refers to the "$GIT_DIR/info/attributes" file. The output is "21", meaning 21 IO calls were executed. While this may not be a big problem when using physical storage (even less so when using SSDs), this _will_ be a problem when using network storage. For example, say every operation takes 2 milliseconds to complete. This would result in _at least_ 400 milliseconds being spent in _just_ the IO operations. The Ruby parser on the other hand only uses a single open() IO call. == Benchmarking To measure the performance of this code I wrote the following benchmark: require 'rugged' require 'benchmark/ips' require_relative 'lib/gitlab_git/attributes' repo = Rugged::Repository.new('/tmp/test/.git') attr = Gitlab::Git::Attributes.new(repo.path) Benchmark.ips(time: 10) do |bench| bench.report 'Rugged' do repo.attributes('test.haml.html')['gitlab-language'] end bench.report 'gitlab_git' do attr.attributes('test.haml.html')['gitlab-language'] end bench.compare! end The contents of /tmp/test/.git/info/attributes are as follows: # This is a comment, it should be ignored. *.txt text *.jpg -text *.sh eol=lf gitlab-language=shell *.haml.* gitlab-language=haml foo/bar.* foo *.cgi key=value?p1=v1&p2=v2 # This uses a tab instead of spaces to ensure the parser also supports this. *.md gitlab-language=markdown Running this benchmark on my development environment produces the following output: Warming up -------------------------------------- Rugged 9.543k i/100ms gitlab_git 43.277k i/100ms Calculating ------------------------------------- Rugged 100.261k (± 2.0%) i/s - 1.012M in 10.093380s gitlab_git 482.186k (± 1.7%) i/s - 4.847M in 10.055286s Comparison: gitlab_git: 482185.6 i/s Rugged: 100260.6 i/s - 4.81x slower The exact output differs on system load but usually the new Ruby based parser is between 4 and 6 times faster than Rugged. To further test this I wrote the following benchmark: require 'benchmark' amount = 5000 rugged = Rugged::Repository.new('/var/opt/gitlab/git-data-ceph/repositories/gitlab-org/gitlab-ce.git') attrs = Gitlab::Git::Attributes.new(rugged.path) rugged = amount.times.map do timing = Benchmark.measure do rugged.attributes('README.md').to_h end timing.real * 1000.0 end ruby = amount.times.map do timing = Benchmark.measure do attrs.attributes('README.md') end timing.real * 1000.0 end puts "Rugged: #{rugged.inject(:+)} ms" puts "Ruby: #{ruby.inject(:+)} ms" This script uses Rugged and the new attributes parser, parses the same attributes file 5000 times, and then counts the total processing time. Running this script on worker1 produced the following output: Rugged: 131.95287296548486 ms Ruby: 30.17003694549203 ms Here the Ruby based solution is around ~4.5 times faster than Rugged. == Further Improvements GitLab may decide to at some point cache the parsed data structures in for example Redis, which is now possible due to them being proper Ruby data structures. Note that this is only really beneficial in cases where Git attributes are requested for the same file path in different requests. This also requires careful cache invalidation. For example, we don't want to invalidate the entire cache when modifying some unrelated file. Because of the complexity involved it's best to leave this for later and only implement it once we're certain it will actually be beneficial.
- Aug 29, 2016
-
-
Yorick Peterse authored
Add Repository#find_branch to speed up branch lookups See merge request !119
- Aug 27, 2016
-
-
Stan Hu authored
-
Stan Hu authored
-
Stan Hu authored
refs DB See https://gitlab.com/gitlab-org/gitlab-ce/issues/15392#note_14538333
-
Stan Hu authored
A common call in GitLab is to lookup a single branch, but previously this was done by calling Repository#branches, which loads all the branches into memory unnecessarily and causes many filesystem accesses for each branch. With Repository#find_branch, we can do a direct lookup for the branch we care about.
-
Stan Hu authored
Fix broken specs caused by update to gitlab-git-test Taken from https://gitlab.com/gitlab-org/gitlab_git/merge_requests/118 See merge request !120
-
Stan Hu authored
Taken from https://gitlab.com/gitlab-org/gitlab_git/merge_requests/118
-
- Aug 17, 2016
-
-
Yorick Peterse authored
Remove unneeded call to Repository#root_ref in #log See merge request !117
-
Ahmad Sherif authored
From our monitoring data, it seems that Repository#root_ref can be slow sometimes (probably because it involves iterating over all branches), and there's no need to have it in the `default_options` hash since a similar effect is achieved in `actual_ref = options[:ref] || root_ref` below, and subsequent calls don't need a `:ref` key in the passed options.
- Aug 15, 2016
-
-
Douwe Maan authored
Optimize fetch of the author and committer of a Rugged commit Closes #26 See merge request !116
- Aug 12, 2016
-
-
Alejandro Rodríguez authored
-
- Aug 08, 2016
-
-
Yorick Peterse authored
We're using Forwardable so we need to require it See merge request !92
-
- Aug 05, 2016
-
-
Yorick Peterse authored
Compare returns an empty collection of commits on nil refs See merge request !114
-
Paco Guzman authored
-
Rémy Coutable authored
Write .gitattributes in binary mode to prevent Rails from converting ASCII-8BIT to UTF-8 This avoids Sidekiq errors in the PostReceive task due to .gitattributes files having ISO-8859 characters, such as: ``` Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8 ``` Closes gitlab-org/gitlab-ce#20647 See merge request !115
-
Stan Hu authored
Closes gitlab-org/gitlab-ce#20647
- Aug 04, 2016
-
-
Yorick Peterse authored
Lazy load compare commits See merge request !113
- Aug 03, 2016
-
-
Paco Guzman authored
-
- Aug 02, 2016
-
-
Stan Hu authored
-
Ahmad Sherif authored
-
Yorick Peterse authored
Add deltas_only option for DiffCollection See merge request !109
- Jul 29, 2016
-
-
Ahmad Sherif authored
It helps avoiding loading the actual patch (which can consume lots of memory) when not needed.
-
- Jul 28, 2016
-
-
Yorick Peterse authored
Improve performance of a decorated DiffCollection instance See merge request !108
-
Paco Guzman authored
-
- Jul 26, 2016
-
-
Douwe Maan authored
Add forwardable require to Repository This class makes use of the Forwardable stdlib but didn't require it explicitly. See merge request !107
-
Robert Speicher authored
This class makes use of the Forwardable stdlib but didn't require it explicitly.
-
- Jul 22, 2016
-
-
Robert Speicher authored
Test against Ruby 2.3 and use caching of gems See merge request !106
-
Zeger-Jan van de Weg authored
-