Commits · master · GitLab.org / gitlab_git

Oct 25, 2016

Refactor Ref's `target` to be consistent with Rugged · 43f5700b

Alejandro Rodríguez authored 8 years ago

!103 introduced an optimization where a Refs target would point to the
dereferenced object, which saved us a lot operations, but became
inconsistent with Rugged's API, where target would always reference the
immediate object pointed at. These changes add a new property,
dereference_target to be consistent with Rugged but still be able to get
the target efficiently.

43f5700b

Oct 21, 2016

Revert to previous diff pruning behavior · c297aa44

Alejandro Rodríguez authored 8 years ago

Before 10.6.9, if a rugged diff met the conditions for being too-large
(which obiously meant it was also collapsable), too-large would take
precedence. This reverts the behavior back to that, since gitlab-ce
depends on this

c297aa44

Oct 12, 2016

Ignore invalid encoding when parsing attributes · 45580062

Yorick Peterse authored 8 years ago

This prevents us from raising a hard error when parsing Git attribute
files that are encoded using an invalid/unsupported encoding. This in
turn means a GitLab user won't be presented with an HTTP 500 error.

Fixes #28

Unverified

45580062

Oct 07, 2016

Optimize diff creation from Rugged::Patch · ad3a18a7

Alejandro Rodríguez authored 8 years ago

Our diff threshold for pruning is fairly small, but we used to iterate
over *each* line, even of a 100mb patch, just to check these small
thresholds. Now we stop when any of the limits is met.

ad3a18a7

Sep 29, 2016
- Add changelog entry for straight diff feature · 595ed5d8
  Dmitriy Zaporozhets authored 8 years ago
  
  Signed-off-by: Dmitriy Zaporozhets <dmitriy.zaporozhets@gmail.com>
  Verified
  
  595ed5d8
Sep 22, 2016
- Commit.find returns nil and no longer throws an error on an empty repository · 7e4bde8f
  James Lopez authored 8 years ago
  
  Also added relevant spec with broken repo.
  7e4bde8f
Sep 14, 2016
- Attribute parser support for paths without attrs · 8509cae3
  Yorick Peterse authored 8 years ago
  
  This adds Git attribute parser support for file paths that don't contain any attributes.
  Verified
  
  8509cae3
Sep 10, 2016
- Retain file mode whenever file is renamed · 0f258836
  tiagonbotelho authored 8 years ago
  
  0f258836
Sep 08, 2016

Check for large diffs upon initialisation · 4c008a2f

Yorick Peterse authored 8 years ago

Prior to this commit the DiffCollection class was responsible for
checking if a diff had to be collapsed or was too large to be displayed
altogether.

This commit changes both DiffCollection and Diff so that Diff itself
checks if its too large or has to be collapsed. These checks happen when
the Diff is being initialised. The patch size is based on the size of
every line in every hunk of the diff, instead of relying on the diff as
a string including diff markers.

DiffCollection still has an extra check to collapse diffs when it has
iterated over too many files. Since this is unrelated to the actual
sizes this has been kept as-is.

For binary files no pruning takes place as the diffs for these files are
not displayed. In the past the size of a diff was reported based on the
diff's size (including metadata). If we were to use the actual file's
size a diff would be marked as being too large and in the case of an
image would never be displayed.

Unverified

4c008a2f

Sep 06, 2016
- Fix attribute support for the "binary" option · 0f62db54
  Yorick Peterse authored 8 years ago
  
  When the "binary" option is set to true the "diff" option is to be set to false automatically.
  Verified
  
  0f62db54
Sep 05, 2016
- Release 10.6.2 · d47cc16e
  Yorick Peterse authored 8 years ago
  
  View commits for tag v10.6.2 v10.6.2 Unverified
  
  d47cc16e
Sep 01, 2016

Handle missing attribute files when parsing · afaeb8f2

Yorick Peterse authored 8 years ago

Previously using Gitlab::Git::Attributes when $GIT_DIR/info/attributes
didn't exist would result in a runtime error. This commit fixes this so
an empty Hash is produced instead.

Verified

afaeb8f2

Aug 31, 2016

Parse Git attribute files using Ruby · 340e111e

Yorick Peterse authored 8 years ago

Rugged provides a way of parsing Git attribute files such as the one
located in $GIT_DIR/info/attributes. Per GitLab's performance monitoring
tools quite a lot of time can be spent in parsing/retrieving attributes.

This commit introduces a pure Ruby parser for gitlab_git that performs
drastically better than the one provided by Rugged.

== Production Timings

As an example, take the commit https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36
(as taken from https://gitlab.com/gitlab-org/gitlab-ce/issues/10785).
When loading this commit we spend between 4 and 6 seconds in
Rugged::Repository#fetch_attributes. This method is called around 1100
times. This is the result of two problems:

1. For every diff we call Gitlab::Git::Repository#diffable? and pass it
   a blob. This method in turn returns a boolean (based on the Git
   attributes for the blob's path) indicating if the content is
   diffable.

2. For every diff we use the GitLab class Gitlab::Highlight which calls
   Repository#gitattribute in the #custom_language method. This is used
   to determine what language to use for highlighting a diff.

As a result in the worst case we'll end up with 2 calls to
Gitlab::Git::Repository#attributes (previously delegated to
Rugged::Repository#attributes).

== Rugged Implementation

Rugged in turn implements the "attributes" method in a rather
in-efficient way. The first time this method is called it will run at
least a single open() call to open the file. On top of that it appears
to run 2 stat() calls for every call to Rugged::Repository#attributes.
In other words, if you call it a 100 times you will end up with 201 IO
calls:

* 200 stat() calls
* 1 open() call

== Rugged IO Overhead

To confirm the IO overhead of Rugged I created the following script
(saved as "confirm.rb"):

    require 'rugged'

    path = '/tmp/test/.git'
    repo = Rugged::Repository.new(path)

    10.times do
      repo.attributes('README.md')['gitlab-language']
    end

I then ran this as follows:

    strace -f ruby confirm.rb 2>&1 | grep -i 'info/attributes' | wc -l

This counts the number of instances an IO call refers to the
"$GIT_DIR/info/attributes" file. The output is "21", meaning 21 IO calls
were executed.

While this may not be a big problem when using physical storage (even
less so when using SSDs), this _will_ be a problem when using network
storage. For example, say every operation takes 2 milliseconds to
complete. This would result in _at least_ 400 milliseconds being spent
in _just_ the IO operations.

The Ruby parser on the other hand only uses a single open() IO call.

== Benchmarking

To measure the performance of this code I wrote the following benchmark:

    require 'rugged'
    require 'benchmark/ips'

    require_relative 'lib/gitlab_git/attributes'

    repo = Rugged::Repository.new('/tmp/test/.git')
    attr = Gitlab::Git::Attributes.new(repo.path)

    Benchmark.ips(time: 10) do |bench|
      bench.report 'Rugged' do
        repo.attributes('test.haml.html')['gitlab-language']
      end

      bench.report 'gitlab_git' do
        attr.attributes('test.haml.html')['gitlab-language']
      end

      bench.compare!
    end

The contents of /tmp/test/.git/info/attributes are as follows:

    # This is a comment, it should be ignored.

    *.txt     text
    *.jpg     -text
    *.sh      eol=lf gitlab-language=shell
    *.haml.*  gitlab-language=haml
    foo/bar.* foo
    *.cgi     key=value?p1=v1&p2=v2

    # This uses a tab instead of spaces to ensure the parser also supports this.
    *.md	gitlab-language=markdown

Running this benchmark on my development environment produces the
following output:

    Warming up --------------------------------------
                  Rugged     9.543k i/100ms
              gitlab_git    43.277k i/100ms
    Calculating -------------------------------------
                  Rugged    100.261k (± 2.0%) i/s -      1.012M in  10.093380s
              gitlab_git    482.186k (± 1.7%) i/s -      4.847M in  10.055286s

    Comparison:
              gitlab_git:   482185.6 i/s
                  Rugged:   100260.6 i/s - 4.81x  slower

The exact output differs on system load but usually the new Ruby based
parser is between 4 and 6 times faster than Rugged.

To further test this I wrote the following benchmark:

    require 'benchmark'

    amount = 5000
    rugged = Rugged::Repository.new('/var/opt/gitlab/git-data-ceph/repositories/gitlab-org/gitlab-ce.git')
    attrs = Gitlab::Git::Attributes.new(rugged.path)

    rugged = amount.times.map do
      timing = Benchmark.measure do
        rugged.attributes('README.md').to_h
      end

      timing.real * 1000.0
    end

    ruby = amount.times.map do
      timing = Benchmark.measure do
        attrs.attributes('README.md')
      end

      timing.real * 1000.0
    end

    puts "Rugged: #{rugged.inject(:+)} ms"
    puts "Ruby: #{ruby.inject(:+)} ms"

This script uses Rugged and the new attributes parser, parses the same
attributes file 5000 times, and then counts the total processing time.
Running this script on worker1 produced the following output:

    Rugged: 131.95287296548486 ms
    Ruby: 30.17003694549203 ms

Here the Ruby based solution is around ~4.5 times faster than Rugged.

== Further Improvements

GitLab may decide to at some point cache the parsed data structures in
for example Redis, which is now possible due to them being proper Ruby
data structures. Note that this is only really beneficial in cases where
Git attributes are requested for the same file path in different
requests. This also requires careful cache invalidation. For example, we
don't want to invalidate the entire cache when modifying some unrelated
file.

Because of the complexity involved it's best to leave this for later and
only implement it once we're certain it will actually be beneficial.

Unverified

340e111e

Aug 27, 2016
- Add CHANGELOG entry about Repository#reload_rugged · 6cb92440
  Stan Hu authored 8 years ago
  
  6cb92440
- Provide Repository#reload_rugged to allow other callers to refresh the repository · 6a75ea50
  Stan Hu authored 8 years ago
  
  6a75ea50
- Add Repository#find_branch to speed up branch lookups · 3606265d
  Stan Hu authored 8 years ago
  
  A common call in GitLab is to lookup a single branch, but previously this was done by calling Repository#branches, which loads all the branches into memory unnecessarily and causes many filesystem accesses for each branch. With Repository#find_branch, we can do a direct lookup for the branch we care about.
  3606265d
Aug 17, 2016
- Release 10.4.7 · bd8946f8
  Yorick Peterse authored 8 years ago
  
  View commits for tag v10.4.7 v10.4.7 Unverified
  
  bd8946f8
Aug 12, 2016
- Optimize fetch of the author and committer of a Rugged commit · fd30738d
  Alejandro Rodríguez authored 8 years ago
  
  fd30738d
Aug 05, 2016
- Compare returns an empty collection of commits on nil refs · 83b7fec6
  Paco Guzman authored 8 years ago
  
  83b7fec6
- Write .gitattributes in binary mode to prevent Rails from converting ASCII-8 BIT to UTF-8 · d754c8bb
  Stan Hu authored 8 years ago
  
  Closes gitlab-org/gitlab-ce#20647
  d754c8bb
Aug 03, 2016
- Lazy load compare commits · 36f97f00
  Paco Guzman authored 8 years ago
  
  36f97f00
Jul 28, 2016
- Improve performance of a decorated DiffCollection instance · c43d998d
  Paco Guzman authored 8 years ago
  
  c43d998d
Jul 21, 2016
- Expose tags git object sha (if it's an annotated tag) · 36943fec
  Alejandro Rodríguez authored 8 years ago
  
  36943fec
Jul 19, 2016
- Bump version to 10.4.0 · 37c5c24c
  Douwe Maan authored 8 years ago
  
  View commits for tag v10.4.0 v10.4.0
  
  37c5c24c
- Move implementation of `local_branches` to `Gitlab::Git::Repository` from gitlab-ce · d481bf4e
  Alejandro Rodríguez authored 8 years ago
  
  d481bf4e
Jul 18, 2016

Refactor Refs to preserve their target objects instead of just a string representation · 396cba39

Alejandro Rodríguez authored 8 years ago

Notice that in `spec/tag_spec.rb` I changed the sha because previously they
referenced the tag annotation commit. With the refactor, the sha must be the one
of the commit that the tag actually points at.

396cba39

Handle collapsable DiffCollection · 703105c3

Paco Guzman authored 8 years ago

We want to give the user the opportunity to see more
files while preserving safe limits on rendering diffs
so files bigger than 10KB (we consider too large 100KB)
their lines are not taking into account to decide
if the collection overflow and you can decide if
you want to expand them on click or no.

To keep good performance we start collapsing all
the files since we are over the safe limits.

To do that you have to methods `collapsed?` and
`collapse` which means if it's quite large and the
other if the collection collapsed it when iterated
as a whole.

It's quite similar to the `too_large`concept but
give us a little bit more flexibility.

703105c3

Optional ref update on commit create · c0067588
Valery Sizov authored 8 years ago

c0067588

Jul 06, 2016
- 10.3.0 is released · 607d6845
  Douwe Maan authored 8 years ago
  
  View commits for tag v10.3.0 v10.3.0
  
  607d6845
- changes tabs to spaces on CHANGELOG file · 413aaa19
  tiagonbotelho authored 8 years ago
  
  413aaa19
- adds new test for rename method and refactors code for the test · 4701bcfa
  tiagonbotelho authored 8 years ago
  
  4701bcfa
- changes tabs to spaces on CHANGELOG file · cfb423fb
  tiagonbotelho authored 8 years ago
  
  cfb423fb
- adds new test for rename method and refactors code for the test · 4bb6289c
  tiagonbotelho authored 8 years ago
  
  4bb6289c
Jul 05, 2016
- Remove Repository#add_tag · 721b5156
  Robert Schilling authored 8 years ago
  
  721b5156
Jun 23, 2016
- Handle nil blob data · 38b190fc
  Stan Hu authored 8 years ago
  
  38b190fc
- Mark CHANGELOG to include Repository#format_patch in version 10.3.0 · 2d9903c5
  Stan Hu authored 8 years ago
  
  2d9903c5
- Add CHANGELOG entry for reverting format_patch changes · 14f00d70
  Stan Hu authored 8 years ago
  
  14f00d70
Jun 22, 2016

Fix bug where truncated? would always be true for files · 480774bb

Stan Hu authored 8 years ago

The encode! method can add or remove bytes to the stream due to a number
of reasons (e.g. bad encoding). We now store the loaded data size before
the data is altered.

Closes gitlab-org/gitlab-ce#18690

480774bb

Jun 16, 2016
- Bump to 10.2.0 · 4b3fd678
  Stan Hu authored 8 years ago
  
  4b3fd678
- Fix bug where truncated? would erroneously change after the first iteration · c08d688d
  Stan Hu authored 8 years ago
  
  encode! is called on data, causing all Unicode characters to be converted into UTF-8.
  c08d688d

Admin message

Admin message