WIP Loose blob cache
Closes https://gitlab.com/gitlab-org/gitlab-workhorse/issues/80
Use loose Git blobs, when present, to make /raw/ blob requests faster. Create loose blobs on the fly if they do not already exist.
Merge request reports
Activity
Still needs tests, sanity checks, metrics.
@chriscool any thoughts what would be a simple test to see if a loose object file created by this code is valid?
- Resolved by Jacob Vosmaer (GitLab)
- Resolved by Jacob Vosmaer (GitLab)
@jacobvosmaer-gitlab maybe you can use
git hash-object -t blob FILE
and check that it gives the same loose object file as what your blob writer creates.Edited by username-removed-138401@chriscool would that detect corruption?
I found the following. If I have a loose object that also exists in a packfile, and I corrupt the packfile, then
git fsck
runs without errors :(. But. If I move the packfile out ofobjects/pack
, rungit unpack-objects < packfile
, and then rungit fsck
, the corrupt loose object makesgit fsck
fail. I think this is because unpack-objects skips existing loose object files without checking if they are valid.So my thinking for testing the validity now is:
- run
git gc
in test repo - verify loose object does not exist
- trigger loose object creation
- verify loose object exists (i.e. was created)
- move packfile out of objects/pack
- unpack packfile
- run
git fsck
- run
Sanity checks I would like:
-
blob ID is a 20-character hex string (so no
/
etc). - BlobWriter checks number of bytes written, errors out in Finalize() if the size does not match the object header.
Edited by Jacob Vosmaer (GitLab)-
blob ID is a 20-character hex string (so no
@jacobvosmaer-gitlab yes I think what I suggest would detect corruption unless your zlib compression is different than the one used by Git. But you should be able to use the same zlib compression.
About fsck behavior, I think that it uses objects from loose object files if there are loose object files. That is similar as what other commands are doing. But I don't think you need to use fsck to detect corruption. In my opinion just comparing bytes from a loose object file generated by your BlobWriter with bytes generated by
git hash-object
is simpler and more thorough.If you don't like using
git hash-object
, you could also after unpacking a packfile in a separate directory just compare the loose object files generated by unpacking the packfile with the loose object files generated by your BlobWriter.Edited by username-removed-138401@chriscool my impression so far has been that the code in BlobWriter does not create the exact same object file, probably because it uses a different zlib implementation (Golang stdlib in this MR vs libz in Git (I presume)). It doesn't feel like a good idea to test for equality of the compressed bytes when what matters is that:
- Git can read the object file
- the uncompressed bytes are equal
Right. I could
- start with
git gc
- trigger creation of a known loose blob by BlobWriter
- read the zlib-decompressed contents of that blob
- compare with a precomputed value (no need to do a
git unpack-objects
dance on each test run to retrieve the same object each time)
Thanks, that sounds better. :)
- start with
changed milestone to %8.15
assigned to @nick.thomas
OK, so
git gc
removes the loose object files being created by this MR, but this will increase disc usage for repos with lots of large binary blobs being accessed. In the worst-case scenario, a project could use double its allowance.Do we have any numbers on how this affects CPU and RAM usage?
- internal/git/looseblob/writer.go 0 → 100644
88 sum := b.hasher.Sum([]byte{}) 89 for i := range sum { 90 if sum[i] != b.blobSha[i] { 91 return fmt.Errorf("gitBlobWriter: SHA1 mismatch: expected %x, got %x", b.blobSha, sum) 92 } 93 } 94 95 if err := b.zlibWriter.Close(); err != nil { 96 return fmt.Errorf("gitBlobWriter: close zlib writer: %v", err) 97 } 98 99 if err := b.file.Close(); err != nil { 100 return fmt.Errorf("gitBlobWriter: close tempfile: %v", err) 101 } 102 103 if err := os.Link(b.file.Name(), b.Path()); err != nil && !os.IsExist(err) { OK, so this is our concurrency control. It creates a hard link from the expected loose blob filename to the temporary file, and the tempfile is removed in the
Writer#Close()
call.The order of calling is
Finalize()
, thenClose()
The error messages (if any) from
Close()
are thrown away. TheFinalize()
ones will result in an error being logged to raven, but won't interrupt completion of the client request.It looks like the worst effect of highly concurrent downloads here would be that disc space will be consumed to a maximum of n*blobsize for a short period of time, and if there's a race on the
Link
call, some harmless errors will go to the log.
- Resolved by Nick Thomas
43 56 } 44 57 defer helper.CleanUpProcessGroup(gitShowCmd) 45 58 46 w.Header().Set("Content-Length", strings.TrimSpace(string(sizeOutput))) 47 if _, err := io.Copy(w, stdout); err != nil { 59 blobWriter, err := looseblob.NewWriter(params.RepoPath, params.BlobId, sizeInt64) But if it doesn't exist, we unconditionally write one. Even if 100 concurrent requests are already doing the same thing.
When downloads of a large, new blob start, all the requests that start before the first request completes will write a temporary file of the same size as the compressed blob. This could result in large disc usage spikes, and is a lot of unnecessary work.
Can we skip this step for all requests except for the first one? We just need to keep track of which looseblobs are in the process of being written, and skip the writing step (or swap in a NullWriter) if our looseblob is being written at the moment.
Can we skip this step for all requests except for the first one?
We could but only within a single gitlab-workhorse process; so on gitlab.com with 20 gitlab-workhorse processes the blob may end up being written 20x concurrently. That may still be worth it.
I like the idea of the NullWriter.
Ah of course, it's a shared filesyste and there's a workhorse per server. Limiting the maximum disc usage to n_{workers} rather than n_{requests} is better, but still quite scary to me.
My pathological case is someone with a git repository containing just one blob, that takes up the whole allocated space they have, abusing GitLab.com as a CDN for it. They make a new release, a few hundred subscribers immediately begin downloading it. It's a little far-fetched, but it shouldn't result in excessive disc use in temporary files.
We could get it back down to 1 instead of n_{workers } by enhancing the per-request locks inside workhorse with an advisory
fcntl()
lock. It works on NFS, and it works on ceph, and would allow the workhorse processes on different machines to ensure that at-most-one looseblob is being created across the whole cluster for a given blob.If a workhorse gets into a bad state and leaves the lock hanging, it will break caching for that blob, but it won't result in the blob being unavailable to clients.
Now that you mention it, we could
syscall.Flock
indeed. I think it would even be fewer LOC than a mutex-protected hash tracking blobs being written out.I am mentioning flock because we already rely on it in gitlab-shell/lib/gitlab_keys.rb, and because it would simultaneously provide a lock inside the process (between goroutines) and between processes (on different NFS clients). If we use fcntl (which locks per PID) we would also need an in-process locking mechanism.
BTW I appreciate having this discussion!
Hah, i'm out of date!
In Linux kernels up to 2.6.11, flock() does not lock files over NFS
. It's supported by ceph too: http://tracker.ceph.com/issues/23However, to my knowledge,
flock()
is still a per-process, not per-thread, lock: http://stackoverflow.com/questions/9462532/multiple-threads-able-to-get-flock-at-the-same-timeWorkhorse will default to running its goroutines across four OS threads, so we'll still need a mutex around acquiring
flock()
.First of all, flock's interface uses file descriptors. From what I read flock applies to 'file descriptors modulo dup'. So if you open a file, flock the corresponding file descriptor, then dup the fd, then you cannot lock the second fd because it points to the same open file. But if you open the file a second time you get an fd you can flock / unlock independently from an existing flock.
As long as each goroutine does its own
file := os.Open(theLockFile)
, they should be able to synchronize viasyscall.Flock(file.Fd(), options)
without needing any mutexes in the process. A mutex would probably be more efficient though, but I would like the duplication of effort of having flock and a mutex. :)Relevant parts of the manpage:
Flock() applies or removes an advisory lock on the file associated with the file descriptor fd
Locks are on files, not file descriptors. That is, file descriptors duplicated through dup(2) or fork(2) do not result in multiple instances of a lock, but rather multiple references to a single lock.
If a process holding a lock on a file forks and the child explicitly unlocks the file, the parent will lose its lock.
and
LOCK_EX Place an exclusive lock. Only one process may hold an exclusive lock for a given file at a given time.
From this: locks are per-file, and per-process. A process, not a thread, holds the lock for the file. I'll put together an example.
Interesting. Looks like I'm wrong:
package main import ( "fmt" "os" "runtime" "sync" "syscall" ) func tryFlock(file *os.File) error { return syscall.Flock(int(file.Fd()), syscall.LOCK_EX|syscall.LOCK_NB) } func tryLocks(file1, file2 *os.File, where string) { if err := tryFlock(file1); err != nil { fmt.Printf("Nick is wrong: locking again on the same fd on %s raised error: %s\n", where, err) } if err := tryFlock(file2); err != nil { fmt.Printf("Nick is wrong: locking on a different fd on %s raised error: %s\n", where, err) } } func main() { // Lock the main() func to a specific OS thread, for sanity runtime.LockOSThread() created, _ := os.Create("test") created.Close() // Create a file to lock file1, err := os.OpenFile("test", os.O_RDWR, 0644) if err != nil { fmt.Printf("Weird: %s\n", err) os.Exit(1) } // Get a second file descriptor for it file2, err := os.OpenFile("test", os.O_RDWR, 0644) if err != nil { fmt.Printf("Weird: %s\n", err) os.Exit(1) } // Acquire a per-process exclusive lock. This should succeed if err := tryFlock(file1); err != nil { fmt.Printf("Weird: %s\n", err) os.Exit(1) } // Now let's try to acquire the lock again. This will succeed because it's a // per-process lock. Doesn't matter which fd we use. tryLocks(file1, file2, "the same OS thread") // Now let's try locking on a different OS thread. The flock calls will // succeed there too, beause it's a per-process lock var wg sync.WaitGroup wg.Add(1) go func() { runtime.LockOSThread() tryLocks(file1, file2, "a different OS thread") wg.Done() }() wg.Wait() }
Gives the output:
Nick is wrong: locking on a different fd on the same OS thread raised error: resource temporarily unavailable Nick is wrong: locking on a different fd on a different OS thread raised error: resource temporarily unavailable
It does this on mac and linux both, although I've not tested it on top of nfs or ceph.
Clearly I need to go away and read the manpage and/or kernel source more carefully!
- Resolved by Jacob Vosmaer (GitLab)
- Resolved by Jacob Vosmaer (GitLab)
- Resolved by Jacob Vosmaer (GitLab)
@jacobvosmaer-gitlab first review done. If we can get the disc space worst case down to double the allocated repository size, I think this should be fine
assigned to @jacobvosmaer-gitlab
In the worst-case scenario, a project could use double its allowance.
Actually, if I take the gitlab-ce repo as an example, in a fully (pretty much optimally) packed state it uses 160MB of disk space, and when unpacking all its loose blobs it adds 1.5G (on top of the packfile which the code in this MR would not remove).
So let's call it 10x more disk space in the worst case.
Regarding loose blob clutter: for repositories with regular push activity this is 'fixed' in the sense that those will incur
git repack
andgit gc
from gitlab-ce, which remove all loose blobs. The only problem is with repositories with infrequent or not push activity. These could be unpacked completely by successive requests.My idea for handling this was:
- when handling blob read requests randomly spawn a 'check job' with probability 1:1000 (so most blob read requests will not spawn the job)
- the 'check job' counts the number of (regular) files in the objects/ directory; if it finds more than say 1000 it runs
git repack
In essence, treat 1/1000 of read requests as a Git push and clean up accordingly.
Now that I write this perhaps the simplest thing we can do is to increment the 'pushes since gc' counter in lib/gitlab/workhorse.rb when preparing a 'send blob' response on 1 in a 1000 requests.
added 2 commits
On second thought, a system where reads slowly increment the 'pushes since gc' counter would show unwanted behavior when a single blob in a repository is requested excessively often (which is one of the cases we want to deal with). So I think we cannot get around the 'measured' response of first counting files and only clearing the blob cache if it has too many entries.
So we have two fundamental improvements that this MR currently lacks:
- avoiding massively concurrent looseblob.Writer instances
- avoiding unrestricted loose blob accumulation on repos with not enough push activity to trigger
git repack
The question is, do we want either or both of these from the start, or can we ship without in the first iteration.
cc @pcarranza
Edited by Jacob Vosmaer (GitLab)I agree @jacobvosmaer-gitlab let's learn about this first.
changed milestone to %8.16
mentioned in issue gitaly#14 (closed)
mentioned in issue gitaly#14 (closed)