The latter is more difficult to build against, but a little faster. Since we're likely moving to gitaly to pull repo data at some point (#4), it makes sense to use the first, but I'll keep !2 (closed) up to date.
We have two choices of encode-to-UTF8 support:
github.com/saintfish/chardet is in !1 (merged) and is a pure-Go implementation of some of libicu.
Again, the latter needs cgo, but is a bit faster. It's also more universally available. The former is buggy and needs us to also pull in pure-go text conversion from golang.org/x/text. I don't like it much.
gitlab-mbp:gitlab lupine$ RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bundle exec bin/elastic_repo_indexer 3000 ~/dev/gitlab.com/gitlab-org/gdk-ee/gitlab Command being timed: "bundle exec bin/elastic_repo_indexer 3000 /Users/lupine/dev/gitlab.com/gitlab-org/gdk-ee/gitlab" User time (seconds): 26.87 System time (seconds): 1.49 Percent of CPU this job got: 44% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04.06 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1446969344 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 219 Minor (reclaiming a frame) page faults: 143069 Voluntary context switches: 52598 Involuntary context switches: 35672 Swaps: 0 File system inputs: 0 File system outputs: 2 Socket messages sent: 386 Socket messages received: 626 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
gitlab-mbp:es-git-go lupine$ make && RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bin/es-git-go 3001 ~/dev/gitlab.com/gitlab-org/gdk-ee/gitlab2017/03/22 20:13:24 Indexing from 0000000000000000000000000000000000000000 to 84ea4fa34fbcda996cc7c3fd970c47e90ae47aaf2017/03/22 20:13:24 Index: gitlab-development, Project ID: 3001 Command being timed: "bin/es-git-go 3001 /Users/lupine/dev/gitlab.com/gitlab-org/gdk-ee/gitlab" User time (seconds): 16.45 System time (seconds): 4.08 Percent of CPU this job got: 83% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.52 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1279475712 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 78215 Voluntary context switches: 226291 Involuntary context switches: 187294 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 18997 Socket messages received: 802 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
!2 (closed) is the libgit2 / git2go implementation:
gitlab-mbp:es-git-go lupine$ make && RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bin/es-git-go 3002 ~/dev/gitlab.com/gitlab-org/gdk-ee/gitlab2017/03/22 20:11:37 Indexing from 0000000000000000000000000000000000000000 to 84ea4fa34fbcda996cc7c3fd970c47e90ae47aaf2017/03/22 20:11:37 Index: gitlab-development, Project ID: 3002 Command being timed: "bin/es-git-go 3002 /Users/lupine/dev/gitlab.com/gitlab-org/gdk-ee/gitlab" User time (seconds): 4.28 System time (seconds): 0.87 Percent of CPU this job got: 34% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.10 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1292369920 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 935 Minor (reclaiming a frame) page faults: 79895 Voluntary context switches: 55177 Involuntary context switches: 74360 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 17780 Socket messages received: 805 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
Given that we already bundle a libgit2 in omnibus-gitlab, and given go-git is (relatively) slow, I'm starting to think we should be using libgit2 here. The presence of at least one bug affecting me in go-git - https://github.com/src-d/go-git/issues/249 - also pushes me in that direction.
The only thing that's really making me hesitate at this point is that this will be the first dynamically linked Go binary that we ship. But we'll need to cross that bridge soon enough with gitaly.
Do you envision needing to ship a go binary that dynamically links with libgit2 as problematic right now @twk3 (or someone else on the build team)?
Ach, sorry @vsizov, I've got my terminology confused. What I'm calling git2go is actually https://github.com/src-d/go-git and what I'm using in !1 (merged) . git2go is of course the name of the libgit2 binding, and what I'm using in !2 (closed) . I've updated the preceding comments accordingly.
gitlab-mbp:es-git-go lupine$ RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bin/es-git-go 3002 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/2017/03/24 13:24:58 Indexing from 0000000000000000000000000000000000000000 to 4ecd03a43349a6dab58bc7f964de13698a5cef0e2017/03/24 13:24:58 Index: gitlab-development, Project ID: 3002 Command being timed: "bin/es-git-go 3002 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/" User time (seconds): 185.50 System time (seconds): 47.31 Percent of CPU this job got: 65% Elapsed (wall clock) time (h:mm:ss or m:ss): 5:55.45 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 7023476736 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 7 Minor (reclaiming a frame) page faults: 432163 Voluntary context switches: 2990752 Involuntary context switches: 2009519 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 391416 Socket messages received: 8113 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
gitlab-mbp:es-git-go lupine$ RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bin/es-git-go 3000 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/2017/03/24 13:08:28 Indexing from 0000000000000000000000000000000000000000 to 4ecd03a43349a6dab58bc7f964de13698a5cef0e2017/03/24 13:08:28 Index: gitlab-development, Project ID: 3000 Command being timed: "bin/es-git-go 3000 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/" User time (seconds): 165.09 System time (seconds): 44.40 Percent of CPU this job got: 64% Elapsed (wall clock) time (h:mm:ss or m:ss): 5:27.32 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 6985236480 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 297 Minor (reclaiming a frame) page faults: 440700 Voluntary context switches: 3451844 Involuntary context switches: 1808225 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 378519 Socket messages received: 8897 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
There's definitely a difference, but given the improvement either represents and #4, I think we should stick with go-git here for now. It's just going to be easier.
Adding encoding slows things down as well. I've added a pure-Go solution to !1 (merged):
gitlab-mbp:es-git-go lupine$ make && RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bin/es-git-go 6002 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/go install -v -ldflags='-X "main.Version=47d39a8-dev" -X "main.BuildTime=2017-03-24-1624 UTC"' gitlab.com/gitlab-org/es-git-gogitlab.com/gitlab-org/es-git-go/indexergitlab.com/gitlab-org/es-git-go2017/03/24 16:24:56 Indexing from 0000000000000000000000000000000000000000 to 4ecd03a43349a6dab58bc7f964de13698a5cef0e2017/03/24 16:24:56 Index: gitlab-development, Project ID: 6002 Command being timed: "bin/es-git-go 6002 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/" User time (seconds): 968.36 System time (seconds): 116.69 Percent of CPU this job got: 159% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:20.83 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 6918733824 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 313 Minor (reclaiming a frame) page faults: 430653 Voluntary context switches: 9903601 Involuntary context switches: 11999421 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 373865 Socket messages received: 8618 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
And here's an icu4c implementation in !3 (merged):
gitlab-mbp:es-git-go lupine$ make && RAILS_ENV=development ELASTIC_CONNECTION_INFO='{"url":["http://10.0.1.3:9200"]}' gtime -v bin/es-git-go 6002 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/go install -v -ldflags='-X "main.Version=f04f4f3-dev" -X "main.BuildTime=2017-03-24-1652 UTC"' gitlab.com/gitlab-org/es-git-go2017/03/24 16:53:01 Indexing from 0000000000000000000000000000000000000000 to 4ecd03a43349a6dab58bc7f964de13698a5cef0e2017/03/24 16:53:01 Index: gitlab-development, Project ID: 6002 Command being timed: "bin/es-git-go 6002 ../gdk-ee/repositories/platform/hardware/bsp/kernel/common/v4.4.git/" User time (seconds): 463.40 System time (seconds): 64.32 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 9:19.82 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 6866763776 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 334 Minor (reclaiming a frame) page faults: 421445 Voluntary context switches: 3390023 Involuntary context switches: 3017168 Swaps: 0 File system inputs: 1 File system outputs: 0 Socket messages sent: 399814 Socket messages received: 8872 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0