- Oct 25, 2016
-
-
Alejandro Rodríguez authored
!103 introduced an optimization where a Refs target would point to the dereferenced object, which saved us a lot operations, but became inconsistent with Rugged's API, where target would always reference the immediate object pointed at. These changes add a new property, dereference_target to be consistent with Rugged but still be able to get the target efficiently.
-
- Oct 21, 2016
-
-
Alejandro Rodríguez authored
Before 10.6.9, if a rugged diff met the conditions for being too-large (which obiously meant it was also collapsable), too-large would take precedence. This reverts the behavior back to that, since gitlab-ce depends on this
-
- Oct 12, 2016
-
-
Yorick Peterse authored
This prevents us from raising a hard error when parsing Git attribute files that are encoded using an invalid/unsupported encoding. This in turn means a GitLab user won't be presented with an HTTP 500 error. Fixes #28
-
- Oct 07, 2016
-
-
Alejandro Rodríguez authored
Our diff threshold for pruning is fairly small, but we used to iterate over *each* line, even of a 100mb patch, just to check these small thresholds. Now we stop when any of the limits is met.
-
- Sep 29, 2016
-
-
Dmitriy Zaporozhets authored
Signed-off-by:
Dmitriy Zaporozhets <dmitriy.zaporozhets@gmail.com>
-
- Sep 22, 2016
-
-
James Lopez authored
Also added relevant spec with broken repo.
-
- Sep 14, 2016
-
-
Yorick Peterse authored
This adds Git attribute parser support for file paths that don't contain any attributes.
-
- Sep 10, 2016
-
-
tiagonbotelho authored
-
- Sep 08, 2016
-
-
Yorick Peterse authored
Prior to this commit the DiffCollection class was responsible for checking if a diff had to be collapsed or was too large to be displayed altogether. This commit changes both DiffCollection and Diff so that Diff itself checks if its too large or has to be collapsed. These checks happen when the Diff is being initialised. The patch size is based on the size of every line in every hunk of the diff, instead of relying on the diff as a string including diff markers. DiffCollection still has an extra check to collapse diffs when it has iterated over too many files. Since this is unrelated to the actual sizes this has been kept as-is. For binary files no pruning takes place as the diffs for these files are not displayed. In the past the size of a diff was reported based on the diff's size (including metadata). If we were to use the actual file's size a diff would be marked as being too large and in the case of an image would never be displayed.
-
- Sep 06, 2016
-
-
Yorick Peterse authored
When the "binary" option is set to true the "diff" option is to be set to false automatically.
-
- Sep 05, 2016
-
- Sep 01, 2016
-
-
Yorick Peterse authored
Previously using Gitlab::Git::Attributes when $GIT_DIR/info/attributes didn't exist would result in a runtime error. This commit fixes this so an empty Hash is produced instead.
-
- Aug 31, 2016
-
-
Yorick Peterse authored
Rugged provides a way of parsing Git attribute files such as the one located in $GIT_DIR/info/attributes. Per GitLab's performance monitoring tools quite a lot of time can be spent in parsing/retrieving attributes. This commit introduces a pure Ruby parser for gitlab_git that performs drastically better than the one provided by Rugged. == Production Timings As an example, take the commit https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36 (as taken from https://gitlab.com/gitlab-org/gitlab-ce/issues/10785). When loading this commit we spend between 4 and 6 seconds in Rugged::Repository#fetch_attributes. This method is called around 1100 times. This is the result of two problems: 1. For every diff we call Gitlab::Git::Repository#diffable? and pass it a blob. This method in turn returns a boolean (based on the Git attributes for the blob's path) indicating if the content is diffable. 2. For every diff we use the GitLab class Gitlab::Highlight which calls Repository#gitattribute in the #custom_language method. This is used to determine what language to use for highlighting a diff. As a result in the worst case we'll end up with 2 calls to Gitlab::Git::Repository#attributes (previously delegated to Rugged::Repository#attributes). == Rugged Implementation Rugged in turn implements the "attributes" method in a rather in-efficient way. The first time this method is called it will run at least a single open() call to open the file. On top of that it appears to run 2 stat() calls for every call to Rugged::Repository#attributes. In other words, if you call it a 100 times you will end up with 201 IO calls: * 200 stat() calls * 1 open() call == Rugged IO Overhead To confirm the IO overhead of Rugged I created the following script (saved as "confirm.rb"): require 'rugged' path = '/tmp/test/.git' repo = Rugged::Repository.new(path) 10.times do repo.attributes('README.md')['gitlab-language'] end I then ran this as follows: strace -f ruby confirm.rb 2>&1 | grep -i 'info/attributes' | wc -l This counts the number of instances an IO call refers to the "$GIT_DIR/info/attributes" file. The output is "21", meaning 21 IO calls were executed. While this may not be a big problem when using physical storage (even less so when using SSDs), this _will_ be a problem when using network storage. For example, say every operation takes 2 milliseconds to complete. This would result in _at least_ 400 milliseconds being spent in _just_ the IO operations. The Ruby parser on the other hand only uses a single open() IO call. == Benchmarking To measure the performance of this code I wrote the following benchmark: require 'rugged' require 'benchmark/ips' require_relative 'lib/gitlab_git/attributes' repo = Rugged::Repository.new('/tmp/test/.git') attr = Gitlab::Git::Attributes.new(repo.path) Benchmark.ips(time: 10) do |bench| bench.report 'Rugged' do repo.attributes('test.haml.html')['gitlab-language'] end bench.report 'gitlab_git' do attr.attributes('test.haml.html')['gitlab-language'] end bench.compare! end The contents of /tmp/test/.git/info/attributes are as follows: # This is a comment, it should be ignored. *.txt text *.jpg -text *.sh eol=lf gitlab-language=shell *.haml.* gitlab-language=haml foo/bar.* foo *.cgi key=value?p1=v1&p2=v2 # This uses a tab instead of spaces to ensure the parser also supports this. *.md gitlab-language=markdown Running this benchmark on my development environment produces the following output: Warming up -------------------------------------- Rugged 9.543k i/100ms gitlab_git 43.277k i/100ms Calculating ------------------------------------- Rugged 100.261k (± 2.0%) i/s - 1.012M in 10.093380s gitlab_git 482.186k (± 1.7%) i/s - 4.847M in 10.055286s Comparison: gitlab_git: 482185.6 i/s Rugged: 100260.6 i/s - 4.81x slower The exact output differs on system load but usually the new Ruby based parser is between 4 and 6 times faster than Rugged. To further test this I wrote the following benchmark: require 'benchmark' amount = 5000 rugged = Rugged::Repository.new('/var/opt/gitlab/git-data-ceph/repositories/gitlab-org/gitlab-ce.git') attrs = Gitlab::Git::Attributes.new(rugged.path) rugged = amount.times.map do timing = Benchmark.measure do rugged.attributes('README.md').to_h end timing.real * 1000.0 end ruby = amount.times.map do timing = Benchmark.measure do attrs.attributes('README.md') end timing.real * 1000.0 end puts "Rugged: #{rugged.inject(:+)} ms" puts "Ruby: #{ruby.inject(:+)} ms" This script uses Rugged and the new attributes parser, parses the same attributes file 5000 times, and then counts the total processing time. Running this script on worker1 produced the following output: Rugged: 131.95287296548486 ms Ruby: 30.17003694549203 ms Here the Ruby based solution is around ~4.5 times faster than Rugged. == Further Improvements GitLab may decide to at some point cache the parsed data structures in for example Redis, which is now possible due to them being proper Ruby data structures. Note that this is only really beneficial in cases where Git attributes are requested for the same file path in different requests. This also requires careful cache invalidation. For example, we don't want to invalidate the entire cache when modifying some unrelated file. Because of the complexity involved it's best to leave this for later and only implement it once we're certain it will actually be beneficial.
-
- Aug 27, 2016
-
-
Stan Hu authored
-
Stan Hu authored
-
Stan Hu authored
A common call in GitLab is to lookup a single branch, but previously this was done by calling Repository#branches, which loads all the branches into memory unnecessarily and causes many filesystem accesses for each branch. With Repository#find_branch, we can do a direct lookup for the branch we care about.
-
- Aug 17, 2016
-
- Aug 12, 2016
-
-
Alejandro Rodríguez authored
-
- Aug 05, 2016
-
-
Paco Guzman authored
-
Stan Hu authored
Closes gitlab-org/gitlab-ce#20647
-
- Aug 03, 2016
-
-
Paco Guzman authored
-
- Jul 28, 2016
-
-
Paco Guzman authored
-
- Jul 21, 2016
-
-
Alejandro Rodríguez authored
-
- Jul 19, 2016
-
-
Alejandro Rodríguez authored
- Jul 18, 2016
-
-
Alejandro Rodríguez authored
Notice that in `spec/tag_spec.rb` I changed the sha because previously they referenced the tag annotation commit. With the refactor, the sha must be the one of the commit that the tag actually points at.
-
Paco Guzman authored
We want to give the user the opportunity to see more files while preserving safe limits on rendering diffs so files bigger than 10KB (we consider too large 100KB) their lines are not taking into account to decide if the collection overflow and you can decide if you want to expand them on click or no. To keep good performance we start collapsing all the files since we are over the safe limits. To do that you have to methods `collapsed?` and `collapse` which means if it's quite large and the other if the collection collapsed it when iterated as a whole. It's quite similar to the `too_large`concept but give us a little bit more flexibility.
-
Valery Sizov authored
-
- Jul 06, 2016
-
-
tiagonbotelho authored
-
tiagonbotelho authored
-
tiagonbotelho authored
-
tiagonbotelho authored
- Jul 05, 2016
-
-
Robert Schilling authored
-
- Jun 23, 2016
- Jun 22, 2016
-
-
Stan Hu authored
The encode! method can add or remove bytes to the stream due to a number of reasons (e.g. bad encoding). We now store the loaded data size before the data is altered. Closes gitlab-org/gitlab-ce#18690
-
- Jun 16, 2016
-
-
Stan Hu authored
-