Skip to content
Snippets Groups Projects
  1. Oct 10, 2016
  2. Oct 07, 2016
  3. Oct 03, 2016
    • Douwe Maan's avatar
      Merge branch 'rs-be_valid_commit-matcher' into 'master' · 08984b77
      Douwe Maan authored
      Fix invalid `be_valid_commit` matcher
      
      This matcher could have theoretically evaluated to the following:
      
      ```ruby
      true
      false
      true
      true
      ```
      
      ...and never would have caused a failure. Indeed, it should have been
      failing for quite a while.
      
      See merge request !126
      08984b77
  4. Oct 02, 2016
    • Robert Speicher's avatar
      Fix invalid `be_valid_commit` matcher · 80e3a7a5
      Robert Speicher authored
      This matcher could have theoretically evaluated to the following:
      
      ```ruby
      true
      false
      true
      true
      ```
      
      ...and never would have caused a failure. Indeed, it should have been
      failing for quite a while.
      80e3a7a5
  5. Sep 29, 2016
  6. Sep 23, 2016
  7. Sep 22, 2016
  8. Sep 14, 2016
  9. Sep 12, 2016
  10. Sep 10, 2016
  11. Sep 09, 2016
  12. Sep 08, 2016
    • Douwe Maan's avatar
      Merge branch 'mark-blobs-binary' into 'master' · 43cc7189
      Douwe Maan authored
      Mark blobs as binary whenever this is known
      
      This removes the need for using `Linguist::BlobHelper#binary?` whenever the binary status is already known, in turn reducing loading times of https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36 (by about 2-ish seconds locally).
      
      See merge request !123
      43cc7189
    • Yorick Peterse's avatar
      Mark blobs as binary whenever this is known · 3bc9611e
      Yorick Peterse authored
      Previously we would rely on Linguist::BlobHelper to determine if a blob
      was binary or not. Since Rugged knows if a blob is binary we can instead
      just inherit this information and fall back to BlobHelper if the binary
      flag wasn't set explicitly.
      
      When testing this with https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36
      it reduces loading times (locally) by around 2 seconds.
      Verified
      3bc9611e
    • Douwe Maan's avatar
      Merge branch 'better-large-diff-handling' into 'master' · 750b9e73
      Douwe Maan authored
      Improve handling of large diffs
      
      This MR adjusts the way checking for large diffs takes place. Prior to this MR the procedure was basically as follows:
      
      1. Iterate over every diff in a collection
      2. Just load the entire diff into memory, why not
      3. Check if the resulting content _including_ any diff markers/meta data exceed a threshold
      4. Prune or collapse the diff
      
      This MR changes things around so the procedure is instead as follows:
      
      1. Iterate over every diff in a collection
      2. Check if the data modified (excluding diff markers) is larger than a threshold
      3. If this is not the case, proceed as usual. if this _is_ the case we'll prune/collapse the diff
      
      See merge request !122
      750b9e73
    • Yorick Peterse's avatar
      Check for large diffs upon initialisation · 4c008a2f
      Yorick Peterse authored
      Prior to this commit the DiffCollection class was responsible for
      checking if a diff had to be collapsed or was too large to be displayed
      altogether.
      
      This commit changes both DiffCollection and Diff so that Diff itself
      checks if its too large or has to be collapsed. These checks happen when
      the Diff is being initialised. The patch size is based on the size of
      every line in every hunk of the diff, instead of relying on the diff as
      a string including diff markers.
      
      DiffCollection still has an extra check to collapse diffs when it has
      iterated over too many files. Since this is unrelated to the actual
      sizes this has been kept as-is.
      
      For binary files no pruning takes place as the diffs for these files are
      not displayed. In the past the size of a diff was reported based on the
      diff's size (including metadata). If we were to use the actual file's
      size a diff would be marked as being too large and in the case of an
      image would never be displayed.
      Unverified
      4c008a2f
  13. Sep 06, 2016
  14. Sep 05, 2016
  15. Sep 01, 2016
  16. Aug 31, 2016
    • Douwe Maan's avatar
    • Douwe Maan's avatar
      Merge branch 'ruby-gitattributes-parser' into 'master' · 62927165
      Douwe Maan authored
      Parse Git attribute files using Ruby
      
      Commit 340e111e contains all the details. It's quite the read so the short summary is:
      
      > Rugged is slow as heck because it runs multiple IO calls every time you request a set of Git attributes. gitlab_git now provides a pure Ruby parser that avoids this and is between 4 and 6 times faster.
      
      Here's a Grafana screenshot to show how bad it can get:
      
      ![timings](/uploads/39f7b6b7b6a8d97f2b11a20a088988e4/timings.jpg)
      
      See https://gitlab.com/gitlab-org/gitlab-ce/issues/10785 for more information.
      
      See merge request !121
      62927165
    • Yorick Peterse's avatar
      Parse Git attribute files using Ruby · 340e111e
      Yorick Peterse authored
      Rugged provides a way of parsing Git attribute files such as the one
      located in $GIT_DIR/info/attributes. Per GitLab's performance monitoring
      tools quite a lot of time can be spent in parsing/retrieving attributes.
      
      This commit introduces a pure Ruby parser for gitlab_git that performs
      drastically better than the one provided by Rugged.
      
      == Production Timings
      
      As an example, take the commit https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36
      (as taken from https://gitlab.com/gitlab-org/gitlab-ce/issues/10785).
      When loading this commit we spend between 4 and 6 seconds in
      Rugged::Repository#fetch_attributes. This method is called around 1100
      times. This is the result of two problems:
      
      1. For every diff we call Gitlab::Git::Repository#diffable? and pass it
         a blob. This method in turn returns a boolean (based on the Git
         attributes for the blob's path) indicating if the content is
         diffable.
      
      2. For every diff we use the GitLab class Gitlab::Highlight which calls
         Repository#gitattribute in the #custom_language method. This is used
         to determine what language to use for highlighting a diff.
      
      As a result in the worst case we'll end up with 2 calls to
      Gitlab::Git::Repository#attributes (previously delegated to
      Rugged::Repository#attributes).
      
      == Rugged Implementation
      
      Rugged in turn implements the "attributes" method in a rather
      in-efficient way. The first time this method is called it will run at
      least a single open() call to open the file. On top of that it appears
      to run 2 stat() calls for every call to Rugged::Repository#attributes.
      In other words, if you call it a 100 times you will end up with 201 IO
      calls:
      
      * 200 stat() calls
      * 1 open() call
      
      == Rugged IO Overhead
      
      To confirm the IO overhead of Rugged I created the following script
      (saved as "confirm.rb"):
      
          require 'rugged'
      
          path = '/tmp/test/.git'
          repo = Rugged::Repository.new(path)
      
          10.times do
            repo.attributes('README.md')['gitlab-language']
          end
      
      I then ran this as follows:
      
          strace -f ruby confirm.rb 2>&1 | grep -i 'info/attributes' | wc -l
      
      This counts the number of instances an IO call refers to the
      "$GIT_DIR/info/attributes" file. The output is "21", meaning 21 IO calls
      were executed.
      
      While this may not be a big problem when using physical storage (even
      less so when using SSDs), this _will_ be a problem when using network
      storage. For example, say every operation takes 2 milliseconds to
      complete. This would result in _at least_ 400 milliseconds being spent
      in _just_ the IO operations.
      
      The Ruby parser on the other hand only uses a single open() IO call.
      
      == Benchmarking
      
      To measure the performance of this code I wrote the following benchmark:
      
          require 'rugged'
          require 'benchmark/ips'
      
          require_relative 'lib/gitlab_git/attributes'
      
          repo = Rugged::Repository.new('/tmp/test/.git')
          attr = Gitlab::Git::Attributes.new(repo.path)
      
          Benchmark.ips(time: 10) do |bench|
            bench.report 'Rugged' do
              repo.attributes('test.haml.html')['gitlab-language']
            end
      
            bench.report 'gitlab_git' do
              attr.attributes('test.haml.html')['gitlab-language']
            end
      
            bench.compare!
          end
      
      The contents of /tmp/test/.git/info/attributes are as follows:
      
          # This is a comment, it should be ignored.
      
          *.txt     text
          *.jpg     -text
          *.sh      eol=lf gitlab-language=shell
          *.haml.*  gitlab-language=haml
          foo/bar.* foo
          *.cgi     key=value?p1=v1&p2=v2
      
          # This uses a tab instead of spaces to ensure the parser also supports this.
          *.md	gitlab-language=markdown
      
      Running this benchmark on my development environment produces the
      following output:
      
          Warming up --------------------------------------
                        Rugged     9.543k i/100ms
                    gitlab_git    43.277k i/100ms
          Calculating -------------------------------------
                        Rugged    100.261k (± 2.0%) i/s -      1.012M in  10.093380s
                    gitlab_git    482.186k (± 1.7%) i/s -      4.847M in  10.055286s
      
          Comparison:
                    gitlab_git:   482185.6 i/s
                        Rugged:   100260.6 i/s - 4.81x  slower
      
      The exact output differs on system load but usually the new Ruby based
      parser is between 4 and 6 times faster than Rugged.
      
      To further test this I wrote the following benchmark:
      
          require 'benchmark'
      
          amount = 5000
          rugged = Rugged::Repository.new('/var/opt/gitlab/git-data-ceph/repositories/gitlab-org/gitlab-ce.git')
          attrs = Gitlab::Git::Attributes.new(rugged.path)
      
          rugged = amount.times.map do
            timing = Benchmark.measure do
              rugged.attributes('README.md').to_h
            end
      
            timing.real * 1000.0
          end
      
          ruby = amount.times.map do
            timing = Benchmark.measure do
              attrs.attributes('README.md')
            end
      
            timing.real * 1000.0
          end
      
          puts "Rugged: #{rugged.inject(:+)} ms"
          puts "Ruby: #{ruby.inject(:+)} ms"
      
      This script uses Rugged and the new attributes parser, parses the same
      attributes file 5000 times, and then counts the total processing time.
      Running this script on worker1 produced the following output:
      
          Rugged: 131.95287296548486 ms
          Ruby: 30.17003694549203 ms
      
      Here the Ruby based solution is around ~4.5 times faster than Rugged.
      
      == Further Improvements
      
      GitLab may decide to at some point cache the parsed data structures in
      for example Redis, which is now possible due to them being proper Ruby
      data structures. Note that this is only really beneficial in cases where
      Git attributes are requested for the same file path in different
      requests. This also requires careful cache invalidation. For example, we
      don't want to invalidate the entire cache when modifying some unrelated
      file.
      
      Because of the complexity involved it's best to leave this for later and
      only implement it once we're certain it will actually be beneficial.
      Unverified
      340e111e
  17. Aug 29, 2016
  18. Aug 27, 2016
Loading