Skip to content
Snippets Groups Projects

Initial implementation of an elasticsearch indexer in Go

Merged Nick Thomas requested to merge 1-initial-implementation into master
All threads resolved!

Closes #1 (closed)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Nick Thomas added 2 commits

    added 2 commits

    • 5d4c8267 - Commits and blobs go in a subdocument
    • 9a572119 - Committers actually go in a correctly-spelled subdocument

    Compare with previous version

  • Nick Thomas resolved all discussions

    resolved all discussions

  • Nick Thomas added 1 commit

    added 1 commit

    • f04f4f3e - Initial encode-to-UTF8 support

    Compare with previous version

  • Nick Thomas added 5 commits

    added 5 commits

    • 96c0fde1 - Use icu4c for encoding to UTF-8
    • c50e3925 - Add libicu-dev to the .gitlab-ci.yml
    • c6ed3c57 - Finish the gitlab-ci run by building the binary
    • 038f1a77 - run make before make test so integration tests will run
    • e9abdd57 - Merge branch '1-initial-icu' into '1-initial-implementation'

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • 6f149bec - Add analysers to the Elasticsearch index

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    Compare with previous version

  • Nick Thomas added 2 commits

    added 2 commits

    Compare with previous version

  • Author Maintainer

    go-git has a bug preventing it from working with long commit messages, including one in the gitlab-ee repository!

    PR at https://github.com/src-d/go-git/pull/314

    I'll update the dependencies once this is merged.

  • Nick Thomas added 1 commit

    added 1 commit

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • e7f1644c - Fill out the first integration test

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • 258774eb - Introduce our own types into the git package

    Compare with previous version

  • Nick Thomas added 2 commits

    added 2 commits

    • 977c0195 - Rename Repo to Repository
    • 29a2bf4f - Add an integration case for timezone preservation

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • 79e50fee - Add integration tests for charset conversions

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • 3fb9011c - Fork icu to fix error handling

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • d854d074 - Add an initial set of tests for git.Repository

    Compare with previous version

  • Nick Thomas added 2 commits

    added 2 commits

    • 15732fa0 - Return BuildBlob() to function from method
    • ba45ea9a - Fix a bug in langauge detection by filename

    Compare with previous version

  • Nick Thomas added 4 commits

    added 4 commits

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • e0206373 - SHA ranges are x..y - add tests and emulate this with go-git

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • a31aabc8 - fixup! SHA ranges are x..y - add tests and emulate this with go-git

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • c7185297 - Add some more tests to git/repository_test.go

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • 39b434b0 - elasticsearch: Test AWS signing

    Compare with previous version

  • Nick Thomas added 1 commit

    added 1 commit

    • 6057d2de - Improve (non-integration) test coverage in elastic

    Compare with previous version

  • Author Maintainer

    IMO this is ready.

    Coverage report:

    gitlab.com/gitlab-org/es-git-go/elastic/client.go:37:		FromEnv				90.9%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:57:		NewClient			85.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:104:		ParentID			100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:108:		Flush				100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:112:		Close				100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:116:		Index				100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:128:		Get				100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:137:		GetCommit			100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:141:		GetBlob				100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/client.go:145:		Remove				100.0%
    gitlab.com/gitlab-org/es-git-go/elastic/elastic.go:18:		ReadConfig			75.0%
    gitlab.com/gitlab-org/es-git-go/elastic/index.go:649:		CreateIndex			66.7%
    gitlab.com/gitlab-org/es-git-go/elastic/index.go:662:		DeleteIndex			66.7%
    gitlab.com/gitlab-org/es-git-go/git/go-git.go:27:		NewGoGitRepository		82.6%
    gitlab.com/gitlab-org/es-git-go/git/go-git.go:70:		diff				80.0%
    gitlab.com/gitlab-org/es-git-go/git/go-git.go:90:		goGitBuildSignature		100.0%
    gitlab.com/gitlab-org/es-git-go/git/go-git.go:98:		goGitBuildFile			100.0%
    gitlab.com/gitlab-org/es-git-go/git/go-git.go:107:		EachFileChange			68.2%
    gitlab.com/gitlab-org/es-git-go/git/go-git.go:155:		EachCommit			80.0%
    gitlab.com/gitlab-org/es-git-go/indexer/blob.go:22:		isSkipBlobErr			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/blob.go:57:		GenerateBlobID			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/blob.go:61:		BuildBlob			92.9%
    gitlab.com/gitlab-org/es-git-go/indexer/blob.go:104:		DetectLanguage			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/blob.go:116:		DetectBinary			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/commit.go:19:		GenerateCommitID		100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/commit.go:23:		BuildCommit			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:15:		init				75.0%
    gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:23:		tryEncodeString			60.0%
    gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:33:		tryEncodeBytes			50.0%
    gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:44:		encodeString			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/encoding.go:49:		encodeBytes			70.0%
    gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:24:		SubmitCommit			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:31:		SubmitBlob			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:45:		RemoveBlob			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:52:		IndexCommits			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:56:		IndexBlobs			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/indexer.go:60:		Index				66.7%
    gitlab.com/gitlab-org/es-git-go/indexer/person.go:19:		GenerateDate			100.0%
    gitlab.com/gitlab-org/es-git-go/indexer/person.go:23:		BuildPerson			100.0%
    gitlab.com/gitlab-org/es-git-go/linguist/language.go:31:	init				100.0%
    gitlab.com/gitlab-org/es-git-go/linguist/language.go:48:	and				0.0%
    gitlab.com/gitlab-org/es-git-go/linguist/language.go:62:	DetectLanguageByFilename	100.0%
    gitlab.com/gitlab-org/es-git-go/linguist/language.go:66:	DetectLanguageByExtension	100.0%
    gitlab.com/gitlab-org/es-git-go/linguist/language.go:70:	DetectLanguage			88.9%
    gitlab.com/gitlab-org/es-git-go/main.go:12:			main				0.0%
    total:								(statements)			75.4%

    (this doesn't take the integration tests into account)

    I've manually verified that we create equivalent documents in gitlab-test. We also get the same number of documents in gitlab-ee and v4 (kernel) repositories.

  • Nick Thomas added 1 commit

    added 1 commit

    • 86bf5779 - Avoid conflicts between tests on the same ES cluster

    Compare with previous version

  • Nick Thomas unmarked as a Work In Progress

    unmarked as a Work In Progress

  • First of all, wow.

    Second: it's hard for me to absorb all this. I don't see anything that looks terribly wrong but there is my favorite subject when it comes to this: freeing resources.

    If I understand correctly this is meant as a sort of one-off process so memory leaks have a limited scope (everything gets cleaned up once the single job is done). But ideally there would be some sort of way to tell git-go when we don't need things anymore so that the underlying resources may be freed. Maybe I missed it but I don't see anything like this.

    What I expect is going on beneath the surface (not clear from the documentation) is that go-git creates zlib readers (one for each loose object we access), memory-mapped (? is that even possible in pure Go?) references to the pack files, and an in-memory copy of the reference database (constructed from packed-refs and whatever is in the refs directory).

    But perhaps I am now speaking more as a Gitaly maintainer wondering whether I can use git-go at some point then as a reviewer of es-git-go. :) Still, even within the scope of just one index job, it is in our interest not to let memory balloon. What do you think @nick.thomas ?

  • Author Maintainer

    @jacobvosmaer-gitlab I agree, it would be bad if this was packed with memory leaks. Per #3, I'm hoping to turn this into a long-lived process in the near future, so that would be death.

    I've done some checking of the memory characteristics of the process as-is, using eyeballs on RSS and some more detailed checking with Go's built-in memory profiler.

    RSS is dominated by those mmap()ed files, and is something the kernel controls rather than us (it'll evict unused pages if there's memory pressure, leave them alone otherwise).

    I've misplaced my memprofile runs, so I'm going to generate some more for you to observe. I didn't see anything indicating memory leaks in the run, but things have changed since so it's worth double-checking!

  • Author Maintainer

    valgrind is happy enough,although I don't know how clever it is about Golang:

    =1380== HEAP SUMMARY:
    ==1380==     in use at exit: 43,670 bytes in 116 blocks
    ==1380==   total heap usage: 1,189,216 allocs, 1,189,100 frees, 210,092,133 bytes allocated
    ...
    ==1380== LEAK SUMMARY:
    ==1380==    definitely lost: 0 bytes in 0 blocks
    ==1380==    indirectly lost: 0 bytes in 0 blocks
    ==1380==      possibly lost: 3,952 bytes in 13 blocks
    ==1380==    still reachable: 39,718 bytes in 103 blocks
    ==1380==         suppressed: 0 bytes in 0 blocks
  • @nick.thomas naively/optimistically I am willing to trust the kernel to evict memory on memory-mapped files. But what is up to us, not the kernel, is to close those files when we're done with them.

    At the very very least we need a way to 'close' a repository. And to close a 'blob'.

    Any statistics on open file descriptors?

  • Author Maintainer

    I see it opening and closing packfiles, but it doesn't keep them alive for extended periods, and we don't accumulate them either.

    We have 2-10 TCP connections open to the elasticsearch server, depending on how busy we are.

    root@241dd6f38d58:/go# lsof -p 1447
    COMMAND    PID USER   FD   TYPE DEVICE  SIZE/OFF     NODE NAME
    es-git-go 1447 root  cwd    DIR   0,41       714 16957335 /es-git-go
    es-git-go 1447 root  rtd    DIR   0,48      4096        2 /
    es-git-go 1447 root  txt    REG   0,41  11345672 19713645 /es-git-go/bin/es-git-go
    es-git-go 1447 root  mem    REG   0,48     90096     2560 /lib/x86_64-linux-gnu/libgcc_s.so.1
    es-git-go 1447 root  mem    REG   0,48   1051056      110 /lib/x86_64-linux-gnu/libm-2.19.so
    es-git-go 1447 root  mem    REG   0,48   1008120     2559 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
    es-git-go 1447 root  mem    REG   0,48     14664       36 /lib/x86_64-linux-gnu/libdl-2.19.so
    es-git-go 1447 root  mem    REG   0,48   1738176       38 /lib/x86_64-linux-gnu/libc-2.19.so
    es-git-go 1447 root  mem    REG   0,48    137384       89 /lib/x86_64-linux-gnu/libpthread-2.19.so
    es-git-go 1447 root  mem    REG   0,48  23512848     5994 /usr/lib/x86_64-linux-gnu/libicudata.so.52.1
    es-git-go 1447 root  mem    REG   0,48   1546256     6047 /usr/lib/x86_64-linux-gnu/libicuuc.so.52.1
    es-git-go 1447 root  mem    REG   0,48   2166128     6012 /usr/lib/x86_64-linux-gnu/libicui18n.so.52.1
    es-git-go 1447 root  mem    REG   0,48    140928       29 /lib/x86_64-linux-gnu/ld-2.19.so
    es-git-go 1447 root    0u   CHR  136,0       0t0        3 /0
    es-git-go 1447 root    1u   CHR  136,0       0t0        3 /0
    es-git-go 1447 root    2u   CHR  136,0       0t0        3 /0
    es-git-go 1447 root    3r   REG   0,41 243930017 17153242 /gitlab/.git/objects/pack/pack-d47e09b8e58c4726dcae30174469fc69150b5acb.pack
    es-git-go 1447 root    6u  IPv4  48524       0t0      TCP 241dd6f38d58:60144->10.0.1.3:9200 (ESTABLISHED)
    es-git-go 1447 root    9u  IPv4  47993       0t0      TCP 241dd6f38d58:60134->10.0.1.3:9200 (ESTABLISHED)
    es-git-go 1447 root   10u  0000   0,11         0     8770 anon_inode
    root@241dd6f38d58:/go# lsof -p 1447
    COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
    es-git-go 1447 root  cwd    DIR   0,41      714 16957335 /es-git-go
    es-git-go 1447 root  rtd    DIR   0,48     4096        2 /
    es-git-go 1447 root  txt    REG   0,41 11345672 19713645 /es-git-go/bin/es-git-go
    es-git-go 1447 root  mem    REG   0,48    90096     2560 /lib/x86_64-linux-gnu/libgcc_s.so.1
    es-git-go 1447 root  mem    REG   0,48  1051056      110 /lib/x86_64-linux-gnu/libm-2.19.so
    es-git-go 1447 root  mem    REG   0,48  1008120     2559 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
    es-git-go 1447 root  mem    REG   0,48    14664       36 /lib/x86_64-linux-gnu/libdl-2.19.so
    es-git-go 1447 root  mem    REG   0,48  1738176       38 /lib/x86_64-linux-gnu/libc-2.19.so
    es-git-go 1447 root  mem    REG   0,48   137384       89 /lib/x86_64-linux-gnu/libpthread-2.19.so
    es-git-go 1447 root  mem    REG   0,48 23512848     5994 /usr/lib/x86_64-linux-gnu/libicudata.so.52.1
    es-git-go 1447 root  mem    REG   0,48  1546256     6047 /usr/lib/x86_64-linux-gnu/libicuuc.so.52.1
    es-git-go 1447 root  mem    REG   0,48  2166128     6012 /usr/lib/x86_64-linux-gnu/libicui18n.so.52.1
    es-git-go 1447 root  mem    REG   0,48   140928       29 /lib/x86_64-linux-gnu/ld-2.19.so
    es-git-go 1447 root    0u   CHR  136,0      0t0        3 /0
    es-git-go 1447 root    1u   CHR  136,0      0t0        3 /0
    es-git-go 1447 root    2u   CHR  136,0      0t0        3 /0
    es-git-go 1447 root    4u  IPv4  49270      0t0      TCP 241dd6f38d58:60164->10.0.1.3:9200 (ESTABLISHED)
    es-git-go 1447 root    5u  IPv4  48855      0t0      TCP 241dd6f38d58:60166->10.0.1.3:9200 (ESTABLISHED)
    es-git-go 1447 root   10u  0000   0,11        0     8770 anon_inode

    The resources in a blob are freed when they go out of scope; we only have one blob open at a time, and we make a copy of all the information it has that we want before going back for another. Those copies are, in turn, freed once they've been turned into a []byte of JSON data - and that's freed once the elasticsearch server comes back with 200 OK

    The repository is a bit more difficult. There's no explicit close, and it does hold onto some data for a map. That will be freed when the repo goes out of scope, which only happens when the program terminates as-is.

    If you want to see it working, I can do a little refactor so there's no Repository in func main(), and take a memprofile before and after it goes out of scope?

  • Author Maintainer

    For blobs specifically, we get an io.ReadCloser out of go-git, which is closed here: https://gitlab.com/gitlab-org/es-git-go/blob/1-initial-implementation/indexer/blob.go#L71

  • Author Maintainer

    Memory profiling looks very good, actually!

    diff --git a/main.go b/main.go
    index 43a299d..d251ff1 100644
    --- a/main.go
    +++ b/main.go
    @@ -3,12 +3,26 @@ package main
     import (
     	"log"
     	"os"
    +	"runtime"
    +	"runtime/pprof"
    
     	"gitlab.com/gitlab-org/es-git-go/elastic"
     	"gitlab.com/gitlab-org/es-git-go/git"
     	"gitlab.com/gitlab-org/es-git-go/indexer"
     )
    
    +func memprofile(name string) {
    +	f, err := os.Create(name)
    +	if err != nil {
    +		log.Fatal("could not create memory profile: ", err)
    +	}
    +	runtime.GC() // get up-to-date statistics
    +	if err := pprof.WriteHeapProfile(f); err != nil {
    +		log.Fatal("could not write memory profile: ", err)
    +	}
    +	f.Close()
    +}
    +
     func main() {
     	if len(os.Args) != 3 {
     		log.Fatalf("Usage: %s <project-id> <project-path>", os.Args[0])
    @@ -37,7 +51,13 @@ func main() {
     	log.Printf("Indexing from %s to %s", repo.FromHash, repo.ToHash)
     	log.Printf("Index: %s, Project ID: %s", esClient.IndexName, esClient.ParentID())
    
    +	// First profile: before doing anything
    +	memprofile("1.memprofile")
    +
     	if err := idx.Index(); err != nil {
     		log.Fatalln("Indexing error: ", err)
     	}
    +
    +	// Second profile: after running everything
    +	memprofile("2.memprofile")
     }

    profile001

    profile002

    Edited by Nick Thomas
  • Author Maintainer

    ICU holds 8MiB for the lifetime of the process, which is presumably lookup tables, etc.

    Our use of go-git isn't scattering unreclaimed objects, and contrary to my expectations, the Repository doesn't have to go out of scope for the 32MiB it allocates on startup to be cleaned!

  • merged

  • Nick Thomas mentioned in commit 9417aaa4

    mentioned in commit 9417aaa4

  • Please register or sign in to reply
    Loading