Skip to content
Snippets Groups Projects
  1. Sep 05, 2016
  2. Sep 01, 2016
  3. Aug 31, 2016
    • Douwe Maan's avatar
    • Douwe Maan's avatar
      Merge branch 'ruby-gitattributes-parser' into 'master' · 62927165
      Douwe Maan authored
      Parse Git attribute files using Ruby
      
      Commit 340e111e contains all the details. It's quite the read so the short summary is:
      
      > Rugged is slow as heck because it runs multiple IO calls every time you request a set of Git attributes. gitlab_git now provides a pure Ruby parser that avoids this and is between 4 and 6 times faster.
      
      Here's a Grafana screenshot to show how bad it can get:
      
      ![timings](/uploads/39f7b6b7b6a8d97f2b11a20a088988e4/timings.jpg)
      
      See https://gitlab.com/gitlab-org/gitlab-ce/issues/10785 for more information.
      
      See merge request !121
      62927165
    • Yorick Peterse's avatar
      Parse Git attribute files using Ruby · 340e111e
      Yorick Peterse authored
      Rugged provides a way of parsing Git attribute files such as the one
      located in $GIT_DIR/info/attributes. Per GitLab's performance monitoring
      tools quite a lot of time can be spent in parsing/retrieving attributes.
      
      This commit introduces a pure Ruby parser for gitlab_git that performs
      drastically better than the one provided by Rugged.
      
      == Production Timings
      
      As an example, take the commit https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36
      (as taken from https://gitlab.com/gitlab-org/gitlab-ce/issues/10785).
      When loading this commit we spend between 4 and 6 seconds in
      Rugged::Repository#fetch_attributes. This method is called around 1100
      times. This is the result of two problems:
      
      1. For every diff we call Gitlab::Git::Repository#diffable? and pass it
         a blob. This method in turn returns a boolean (based on the Git
         attributes for the blob's path) indicating if the content is
         diffable.
      
      2. For every diff we use the GitLab class Gitlab::Highlight which calls
         Repository#gitattribute in the #custom_language method. This is used
         to determine what language to use for highlighting a diff.
      
      As a result in the worst case we'll end up with 2 calls to
      Gitlab::Git::Repository#attributes (previously delegated to
      Rugged::Repository#attributes).
      
      == Rugged Implementation
      
      Rugged in turn implements the "attributes" method in a rather
      in-efficient way. The first time this method is called it will run at
      least a single open() call to open the file. On top of that it appears
      to run 2 stat() calls for every call to Rugged::Repository#attributes.
      In other words, if you call it a 100 times you will end up with 201 IO
      calls:
      
      * 200 stat() calls
      * 1 open() call
      
      == Rugged IO Overhead
      
      To confirm the IO overhead of Rugged I created the following script
      (saved as "confirm.rb"):
      
          require 'rugged'
      
          path = '/tmp/test/.git'
          repo = Rugged::Repository.new(path)
      
          10.times do
            repo.attributes('README.md')['gitlab-language']
          end
      
      I then ran this as follows:
      
          strace -f ruby confirm.rb 2>&1 | grep -i 'info/attributes' | wc -l
      
      This counts the number of instances an IO call refers to the
      "$GIT_DIR/info/attributes" file. The output is "21", meaning 21 IO calls
      were executed.
      
      While this may not be a big problem when using physical storage (even
      less so when using SSDs), this _will_ be a problem when using network
      storage. For example, say every operation takes 2 milliseconds to
      complete. This would result in _at least_ 400 milliseconds being spent
      in _just_ the IO operations.
      
      The Ruby parser on the other hand only uses a single open() IO call.
      
      == Benchmarking
      
      To measure the performance of this code I wrote the following benchmark:
      
          require 'rugged'
          require 'benchmark/ips'
      
          require_relative 'lib/gitlab_git/attributes'
      
          repo = Rugged::Repository.new('/tmp/test/.git')
          attr = Gitlab::Git::Attributes.new(repo.path)
      
          Benchmark.ips(time: 10) do |bench|
            bench.report 'Rugged' do
              repo.attributes('test.haml.html')['gitlab-language']
            end
      
            bench.report 'gitlab_git' do
              attr.attributes('test.haml.html')['gitlab-language']
            end
      
            bench.compare!
          end
      
      The contents of /tmp/test/.git/info/attributes are as follows:
      
          # This is a comment, it should be ignored.
      
          *.txt     text
          *.jpg     -text
          *.sh      eol=lf gitlab-language=shell
          *.haml.*  gitlab-language=haml
          foo/bar.* foo
          *.cgi     key=value?p1=v1&p2=v2
      
          # This uses a tab instead of spaces to ensure the parser also supports this.
          *.md	gitlab-language=markdown
      
      Running this benchmark on my development environment produces the
      following output:
      
          Warming up --------------------------------------
                        Rugged     9.543k i/100ms
                    gitlab_git    43.277k i/100ms
          Calculating -------------------------------------
                        Rugged    100.261k (± 2.0%) i/s -      1.012M in  10.093380s
                    gitlab_git    482.186k (± 1.7%) i/s -      4.847M in  10.055286s
      
          Comparison:
                    gitlab_git:   482185.6 i/s
                        Rugged:   100260.6 i/s - 4.81x  slower
      
      The exact output differs on system load but usually the new Ruby based
      parser is between 4 and 6 times faster than Rugged.
      
      To further test this I wrote the following benchmark:
      
          require 'benchmark'
      
          amount = 5000
          rugged = Rugged::Repository.new('/var/opt/gitlab/git-data-ceph/repositories/gitlab-org/gitlab-ce.git')
          attrs = Gitlab::Git::Attributes.new(rugged.path)
      
          rugged = amount.times.map do
            timing = Benchmark.measure do
              rugged.attributes('README.md').to_h
            end
      
            timing.real * 1000.0
          end
      
          ruby = amount.times.map do
            timing = Benchmark.measure do
              attrs.attributes('README.md')
            end
      
            timing.real * 1000.0
          end
      
          puts "Rugged: #{rugged.inject(:+)} ms"
          puts "Ruby: #{ruby.inject(:+)} ms"
      
      This script uses Rugged and the new attributes parser, parses the same
      attributes file 5000 times, and then counts the total processing time.
      Running this script on worker1 produced the following output:
      
          Rugged: 131.95287296548486 ms
          Ruby: 30.17003694549203 ms
      
      Here the Ruby based solution is around ~4.5 times faster than Rugged.
      
      == Further Improvements
      
      GitLab may decide to at some point cache the parsed data structures in
      for example Redis, which is now possible due to them being proper Ruby
      data structures. Note that this is only really beneficial in cases where
      Git attributes are requested for the same file path in different
      requests. This also requires careful cache invalidation. For example, we
      don't want to invalidate the entire cache when modifying some unrelated
      file.
      
      Because of the complexity involved it's best to leave this for later and
      only implement it once we're certain it will actually be beneficial.
      Unverified
      340e111e
  4. Aug 29, 2016
  5. Aug 27, 2016
  6. Aug 17, 2016
  7. Aug 15, 2016
  8. Aug 12, 2016
  9. Aug 08, 2016
  10. Aug 05, 2016
  11. Aug 04, 2016
  12. Aug 03, 2016
  13. Aug 02, 2016
  14. Jul 29, 2016
  15. Jul 28, 2016
Loading