Initial implementation of an elasticsearch indexer in Go
Closes #1 (closed)
Merge request reports
Activity
OK, I've created this MR so we can have a proper review and approval process for es-git-go.
Lots of previous context on https://gitlab.com/gitlab-org/gitlab-ee/issues/1606
added 35 commits
-
be84ab1b - 1 commit from branch
master
- 6214ea34 - Starting point: list all changes and commits that need indexing
- ad4c9c9b - Initial vendoring
- 1b218c84 - Initial blob-reading and json-encoding stubs
- 4b7a937a - Reorganise code into two subpackages
- cdbf89c7 - Add a simple Makefile
- 217bf97d - Start CI configuration
- 00e99e32 - Add .gitignore
- c41f8d09 - Introduce an elasticsearch client
- 0c57274e - .gitlab-ci.yml: Put es-git-go into $GOPATH
- 25346fb7 - Makefile: add coverage
- 4986d042 - Exit with a non-zero error code if communicating with elasticsearch fails
- 90393cc7 - Introduce an explicit indexer
- 8fe01d9b - Vendor in the elasticsearch dependencies
- 415dc9ea - Fix listing changed files
- ff5856a7 - Import Cloudflare Makefile
- 4d4ea305 - go fmt
- 88a7459f - Remove two unused structs
- f27c5260 - Indexer should refer blobs, not files
- 64244977 - Add the index name to the indexer
- eecba7d6 - Numerous changes to get the first integration test ready
- 6d9542bd - Remove support for Go < v1.7
- f0a3b562 - Move CreateIndex() and DeleteIndex() out of the Indexer
- 05c4a49b - Add types and interfaces for commits
- ae645523 - Move indexer.Person to its own file
- 60ca8cb8 - Add types and interfaces for blobs
- 5d02dc60 - First attempt at actually indexing blobs and commits
- 2f35c7f8 - go-git has problems with concurrent use of git.Repo, so avoid it
- 3a2127a4 - Skip too-large blobs
- 69a85e50 - Increase the number of bulk workers from 2 to 10
- d125b96c - Pass commit dates to elastisearch
- 14cb59a5 - Fix elasticsearch date format
- b4706f27 - Commits should not index their id
- 84370c64 - Tag errors coming out of WalkCommitHistory to aid debugging
- ec2d0762 - Implement the 1MiB text-only blob indexing limits
Toggle commit list-
be84ab1b - 1 commit from branch
mentioned in issue #1 (closed)
Language detection is more primitive than the Ruby version, but it's present now, at least.
Should it be kept in a separate library perhaps? In time it could become a complete implementation of github-linguist.
Binary vs. text detection is still very primitive, especially compared to what we're doing in Ruby
The only completely missing part is the encoding magic (added now)
I think the existing schema is wrong - commits need to go inside a
commits
parent and blobs into ablobs
parent (fixed now)Edited by Nick Thomas- Resolved by Nick Thomas
added 5 commits
Toggle commit listadded 2 commits
go-git has a bug preventing it from working with long commit messages, including one in the gitlab-ee repository!
PR at https://github.com/src-d/go-git/pull/314
I'll update the dependencies once this is merged.
added 1 commit
- 258774eb - Introduce our own types into the git package
added 2 commits
added 1 commit
- 79e50fee - Add integration tests for charset conversions
added 1 commit
- d854d074 - Add an initial set of tests for git.Repository
added 1 commit
- e0206373 - SHA ranges are x..y - add tests and emulate this with go-git
added 1 commit
- a31aabc8 - fixup! SHA ranges are x..y - add tests and emulate this with go-git
added 1 commit
- c7185297 - Add some more tests to git/repository_test.go
added 1 commit
- 6057d2de - Improve (non-integration) test coverage in elastic
IMO this is ready.
Coverage report:
gitlab.com/gitlab-org/es-git-go/elastic/client.go:37: FromEnv 90.9% gitlab.com/gitlab-org/es-git-go/elastic/client.go:57: NewClient 85.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:104: ParentID 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:108: Flush 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:112: Close 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:116: Index 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:128: Get 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:137: GetCommit 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:141: GetBlob 100.0% gitlab.com/gitlab-org/es-git-go/elastic/client.go:145: Remove 100.0% gitlab.com/gitlab-org/es-git-go/elastic/elastic.go:18: ReadConfig 75.0% gitlab.com/gitlab-org/es-git-go/elastic/index.go:649: CreateIndex 66.7% gitlab.com/gitlab-org/es-git-go/elastic/index.go:662: DeleteIndex 66.7% gitlab.com/gitlab-org/es-git-go/git/go-git.go:27: NewGoGitRepository 82.6% gitlab.com/gitlab-org/es-git-go/git/go-git.go:70: diff 80.0% gitlab.com/gitlab-org/es-git-go/git/go-git.go:90: goGitBuildSignature 100.0% gitlab.com/gitlab-org/es-git-go/git/go-git.go:98: goGitBuildFile 100.0% gitlab.com/gitlab-org/es-git-go/git/go-git.go:107: EachFileChange 68.2% gitlab.com/gitlab-org/es-git-go/git/go-git.go:155: EachCommit 80.0% gitlab.com/gitlab-org/es-git-go/indexer/blob.go:22: isSkipBlobErr 100.0% gitlab.com/gitlab-org/es-git-go/indexer/blob.go:57: GenerateBlobID 100.0% gitlab.com/gitlab-org/es-git-go/indexer/blob.go:61: BuildBlob 92.9% gitlab.com/gitlab-org/es-git-go/indexer/blob.go:104: DetectLanguage 100.0% gitlab.com/gitlab-org/es-git-go/indexer/blob.go:116: DetectBinary 100.0% gitlab.com/gitlab-org/es-git-go/indexer/commit.go:19: GenerateCommitID 100.0% gitlab.com/gitlab-org/es-git-go/indexer/commit.go:23: BuildCommit 100.0% gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:15: init 75.0% gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:23: tryEncodeString 60.0% gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:33: tryEncodeBytes 50.0% gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:44: encodeString 100.0% gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:49: encodeBytes 70.0% gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:24: SubmitCommit 100.0% gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:31: SubmitBlob 100.0% gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:45: RemoveBlob 100.0% gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:52: IndexCommits 100.0% gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:56: IndexBlobs 100.0% gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:60: Index 66.7% gitlab.com/gitlab-org/es-git-go/indexer/person.go:19: GenerateDate 100.0% gitlab.com/gitlab-org/es-git-go/indexer/person.go:23: BuildPerson 100.0% gitlab.com/gitlab-org/es-git-go/linguist/language.go:31: init 100.0% gitlab.com/gitlab-org/es-git-go/linguist/language.go:48: and 0.0% gitlab.com/gitlab-org/es-git-go/linguist/language.go:62: DetectLanguageByFilename 100.0% gitlab.com/gitlab-org/es-git-go/linguist/language.go:66: DetectLanguageByExtension 100.0% gitlab.com/gitlab-org/es-git-go/linguist/language.go:70: DetectLanguage 88.9% gitlab.com/gitlab-org/es-git-go/main.go:12: main 0.0% total: (statements) 75.4%
(this doesn't take the integration tests into account)
I've manually verified that we create equivalent documents in gitlab-test. We also get the same number of documents in gitlab-ee and v4 (kernel) repositories.
added 1 commit
- 86bf5779 - Avoid conflicts between tests on the same ES cluster
@jacobvosmaer-gitlab I know you’re busy, but could I prevail on you to do a quick review of https://gitlab.com/gitlab-org/es-git-go/merge_requests/1 ?
Happy for you to just focus on the git implementation part if you’ve got time: https://gitlab.com/gitlab-org/es-git-go/merge_requests/1/diffs#74845996aab6212584b15423c7efc294526f81ac
It's replacing https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer https://gitlab.com/gitlab-org/gitlab-elasticsearch-git/blob/master/lib/elasticsearch/git/repository.rb
Edited by Nick ThomasFirst of all, wow.
Second: it's hard for me to absorb all this. I don't see anything that looks terribly wrong but there is my favorite subject when it comes to this: freeing resources.
If I understand correctly this is meant as a sort of one-off process so memory leaks have a limited scope (everything gets cleaned up once the single job is done). But ideally there would be some sort of way to tell git-go when we don't need things anymore so that the underlying resources may be freed. Maybe I missed it but I don't see anything like this.
What I expect is going on beneath the surface (not clear from the documentation) is that go-git creates zlib readers (one for each loose object we access), memory-mapped (? is that even possible in pure Go?) references to the pack files, and an in-memory copy of the reference database (constructed from
packed-refs
and whatever is in therefs
directory).But perhaps I am now speaking more as a Gitaly maintainer wondering whether I can use git-go at some point then as a reviewer of es-git-go. :) Still, even within the scope of just one index job, it is in our interest not to let memory balloon. What do you think @nick.thomas ?
@jacobvosmaer-gitlab I agree, it would be bad if this was packed with memory leaks. Per #3, I'm hoping to turn this into a long-lived process in the near future, so that would be death.
I've done some checking of the memory characteristics of the process as-is, using eyeballs on RSS and some more detailed checking with Go's built-in memory profiler.
RSS is dominated by those mmap()ed files, and is something the kernel controls rather than us (it'll evict unused pages if there's memory pressure, leave them alone otherwise).
I've misplaced my memprofile runs, so I'm going to generate some more for you to observe. I didn't see anything indicating memory leaks in the run, but things have changed since so it's worth double-checking!
valgrind is happy enough,although I don't know how clever it is about Golang:
=1380== HEAP SUMMARY: ==1380== in use at exit: 43,670 bytes in 116 blocks ==1380== total heap usage: 1,189,216 allocs, 1,189,100 frees, 210,092,133 bytes allocated ... ==1380== LEAK SUMMARY: ==1380== definitely lost: 0 bytes in 0 blocks ==1380== indirectly lost: 0 bytes in 0 blocks ==1380== possibly lost: 3,952 bytes in 13 blocks ==1380== still reachable: 39,718 bytes in 103 blocks ==1380== suppressed: 0 bytes in 0 blocks
@nick.thomas naively/optimistically I am willing to trust the kernel to evict memory on memory-mapped files. But what is up to us, not the kernel, is to close those files when we're done with them.
At the very very least we need a way to 'close' a repository. And to close a 'blob'.
Any statistics on open file descriptors?
I see it opening and closing packfiles, but it doesn't keep them alive for extended periods, and we don't accumulate them either.
We have 2-10 TCP connections open to the elasticsearch server, depending on how busy we are.
root@241dd6f38d58:/go# lsof -p 1447 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME es-git-go 1447 root cwd DIR 0,41 714 16957335 /es-git-go es-git-go 1447 root rtd DIR 0,48 4096 2 / es-git-go 1447 root txt REG 0,41 11345672 19713645 /es-git-go/bin/es-git-go es-git-go 1447 root mem REG 0,48 90096 2560 /lib/x86_64-linux-gnu/libgcc_s.so.1 es-git-go 1447 root mem REG 0,48 1051056 110 /lib/x86_64-linux-gnu/libm-2.19.so es-git-go 1447 root mem REG 0,48 1008120 2559 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20 es-git-go 1447 root mem REG 0,48 14664 36 /lib/x86_64-linux-gnu/libdl-2.19.so es-git-go 1447 root mem REG 0,48 1738176 38 /lib/x86_64-linux-gnu/libc-2.19.so es-git-go 1447 root mem REG 0,48 137384 89 /lib/x86_64-linux-gnu/libpthread-2.19.so es-git-go 1447 root mem REG 0,48 23512848 5994 /usr/lib/x86_64-linux-gnu/libicudata.so.52.1 es-git-go 1447 root mem REG 0,48 1546256 6047 /usr/lib/x86_64-linux-gnu/libicuuc.so.52.1 es-git-go 1447 root mem REG 0,48 2166128 6012 /usr/lib/x86_64-linux-gnu/libicui18n.so.52.1 es-git-go 1447 root mem REG 0,48 140928 29 /lib/x86_64-linux-gnu/ld-2.19.so es-git-go 1447 root 0u CHR 136,0 0t0 3 /0 es-git-go 1447 root 1u CHR 136,0 0t0 3 /0 es-git-go 1447 root 2u CHR 136,0 0t0 3 /0 es-git-go 1447 root 3r REG 0,41 243930017 17153242 /gitlab/.git/objects/pack/pack-d47e09b8e58c4726dcae30174469fc69150b5acb.pack es-git-go 1447 root 6u IPv4 48524 0t0 TCP 241dd6f38d58:60144->10.0.1.3:9200 (ESTABLISHED) es-git-go 1447 root 9u IPv4 47993 0t0 TCP 241dd6f38d58:60134->10.0.1.3:9200 (ESTABLISHED) es-git-go 1447 root 10u 0000 0,11 0 8770 anon_inode
root@241dd6f38d58:/go# lsof -p 1447 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME es-git-go 1447 root cwd DIR 0,41 714 16957335 /es-git-go es-git-go 1447 root rtd DIR 0,48 4096 2 / es-git-go 1447 root txt REG 0,41 11345672 19713645 /es-git-go/bin/es-git-go es-git-go 1447 root mem REG 0,48 90096 2560 /lib/x86_64-linux-gnu/libgcc_s.so.1 es-git-go 1447 root mem REG 0,48 1051056 110 /lib/x86_64-linux-gnu/libm-2.19.so es-git-go 1447 root mem REG 0,48 1008120 2559 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20 es-git-go 1447 root mem REG 0,48 14664 36 /lib/x86_64-linux-gnu/libdl-2.19.so es-git-go 1447 root mem REG 0,48 1738176 38 /lib/x86_64-linux-gnu/libc-2.19.so es-git-go 1447 root mem REG 0,48 137384 89 /lib/x86_64-linux-gnu/libpthread-2.19.so es-git-go 1447 root mem REG 0,48 23512848 5994 /usr/lib/x86_64-linux-gnu/libicudata.so.52.1 es-git-go 1447 root mem REG 0,48 1546256 6047 /usr/lib/x86_64-linux-gnu/libicuuc.so.52.1 es-git-go 1447 root mem REG 0,48 2166128 6012 /usr/lib/x86_64-linux-gnu/libicui18n.so.52.1 es-git-go 1447 root mem REG 0,48 140928 29 /lib/x86_64-linux-gnu/ld-2.19.so es-git-go 1447 root 0u CHR 136,0 0t0 3 /0 es-git-go 1447 root 1u CHR 136,0 0t0 3 /0 es-git-go 1447 root 2u CHR 136,0 0t0 3 /0 es-git-go 1447 root 4u IPv4 49270 0t0 TCP 241dd6f38d58:60164->10.0.1.3:9200 (ESTABLISHED) es-git-go 1447 root 5u IPv4 48855 0t0 TCP 241dd6f38d58:60166->10.0.1.3:9200 (ESTABLISHED) es-git-go 1447 root 10u 0000 0,11 0 8770 anon_inode
The resources in a blob are freed when they go out of scope; we only have one blob open at a time, and we make a copy of all the information it has that we want before going back for another. Those copies are, in turn, freed once they've been turned into a
[]byte
of JSON data - and that's freed once the elasticsearch server comes back with 200 OKThe repository is a bit more difficult. There's no explicit close, and it does hold onto some data for a map. That will be freed when the repo goes out of scope, which only happens when the program terminates as-is.
If you want to see it working, I can do a little refactor so there's no Repository in
func main()
, and take a memprofile before and after it goes out of scope?For blobs specifically, we get an
io.ReadCloser
out of go-git, which is closed here: https://gitlab.com/gitlab-org/es-git-go/blob/1-initial-implementation/indexer/blob.go#L71Memory profiling looks very good, actually!
diff --git a/main.go b/main.go index 43a299d..d251ff1 100644 --- a/main.go +++ b/main.go @@ -3,12 +3,26 @@ package main import ( "log" "os" + "runtime" + "runtime/pprof" "gitlab.com/gitlab-org/es-git-go/elastic" "gitlab.com/gitlab-org/es-git-go/git" "gitlab.com/gitlab-org/es-git-go/indexer" ) +func memprofile(name string) { + f, err := os.Create(name) + if err != nil { + log.Fatal("could not create memory profile: ", err) + } + runtime.GC() // get up-to-date statistics + if err := pprof.WriteHeapProfile(f); err != nil { + log.Fatal("could not write memory profile: ", err) + } + f.Close() +} + func main() { if len(os.Args) != 3 { log.Fatalf("Usage: %s <project-id> <project-path>", os.Args[0]) @@ -37,7 +51,13 @@ func main() { log.Printf("Indexing from %s to %s", repo.FromHash, repo.ToHash) log.Printf("Index: %s, Project ID: %s", esClient.IndexName, esClient.ParentID()) + // First profile: before doing anything + memprofile("1.memprofile") + if err := idx.Index(); err != nil { log.Fatalln("Indexing error: ", err) } + + // Second profile: after running everything + memprofile("2.memprofile") }
Edited by Nick Thomasmentioned in commit 9417aaa4