Skip to content
Snippets Groups Projects

WIP Loose blob cache

Closed Jacob Vosmaer (GitLab) requested to merge cat-file-cache into master
2 unresolved threads

Closes https://gitlab.com/gitlab-org/gitlab-workhorse/issues/80

Use loose Git blobs, when present, to make /raw/ blob requests faster. Create loose blobs on the fly if they do not already exist.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • @jacobvosmaer-gitlab maybe you can use git hash-object -t blob FILE and check that it gives the same loose object file as what your blob writer creates.

    Edited by username-removed-138401
  • @chriscool would that detect corruption?

    I found the following. If I have a loose object that also exists in a packfile, and I corrupt the packfile, then git fsck runs without errors :(. But. If I move the packfile out of objects/pack, run git unpack-objects < packfile, and then run git fsck, the corrupt loose object makes git fsck fail. I think this is because unpack-objects skips existing loose object files without checking if they are valid.

    So my thinking for testing the validity now is:

    • run git gc in test repo
    • verify loose object does not exist
    • trigger loose object creation
    • verify loose object exists (i.e. was created)
    • move packfile out of objects/pack
    • unpack packfile
    • run git fsck
  • Sanity checks I would like:

    • blob ID is a 20-character hex string (so no / etc).
    • BlobWriter checks number of bytes written, errors out in Finalize() if the size does not match the object header.
    Edited by Jacob Vosmaer (GitLab)
  • @jacobvosmaer-gitlab yes I think what I suggest would detect corruption unless your zlib compression is different than the one used by Git. But you should be able to use the same zlib compression.

    About fsck behavior, I think that it uses objects from loose object files if there are loose object files. That is similar as what other commands are doing. But I don't think you need to use fsck to detect corruption. In my opinion just comparing bytes from a loose object file generated by your BlobWriter with bytes generated by git hash-object is simpler and more thorough.

    If you don't like using git hash-object, you could also after unpacking a packfile in a separate directory just compare the loose object files generated by unpacking the packfile with the loose object files generated by your BlobWriter.

    Edited by username-removed-138401
  • @chriscool my impression so far has been that the code in BlobWriter does not create the exact same object file, probably because it uses a different zlib implementation (Golang stdlib in this MR vs libz in Git (I presume)). It doesn't feel like a good idea to test for equality of the compressed bytes when what matters is that:

    • Git can read the object file
    • the uncompressed bytes are equal
  • Maybe you could compare the uncompressed files. On Debian based distribution, you can use zlib-flate to zlib uncompress (-uncompress), or zlib compress (-compress), files. If zlib-flate can properly uncompress the files, git should be able to do it too.

  • Right. I could

    • start with git gc
    • trigger creation of a known loose blob by BlobWriter
    • read the zlib-decompressed contents of that blob
    • compare with a precomputed value (no need to do a git unpack-objects dance on each test run to retrieve the same object each time)

    Thanks, that sounds better. :)

  • added 1 commit

    Compare with previous version

  • added 1 commit

    • 3254171a - Push blob code into looseblob.go

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • added 1 commit

    • 2294ad31 - Create internal/git/looseblob package

    Compare with previous version

  • Jacob Vosmaer (GitLab) resolved all discussions

    resolved all discussions

  • added 1 commit

    • 850d8824 - Add size and SHA1 checks in Finalize()

    Compare with previous version

  • Jacob Vosmaer (GitLab) unmarked as a Work In Progress

    unmarked as a Work In Progress

  • Jacob Vosmaer (GitLab) changed title from WIP Prototype loose blob cache to Loose blob cache

    changed title from WIP Prototype loose blob cache to Loose blob cache

  • changed milestone to %8.15

  • OK, so git gc removes the loose object files being created by this MR, but this will increase disc usage for repos with lots of large binary blobs being accessed. In the worst-case scenario, a project could use double its allowance.

    Do we have any numbers on how this affects CPU and RAM usage?

  • 88 sum := b.hasher.Sum([]byte{})
    89 for i := range sum {
    90 if sum[i] != b.blobSha[i] {
    91 return fmt.Errorf("gitBlobWriter: SHA1 mismatch: expected %x, got %x", b.blobSha, sum)
    92 }
    93 }
    94
    95 if err := b.zlibWriter.Close(); err != nil {
    96 return fmt.Errorf("gitBlobWriter: close zlib writer: %v", err)
    97 }
    98
    99 if err := b.file.Close(); err != nil {
    100 return fmt.Errorf("gitBlobWriter: close tempfile: %v", err)
    101 }
    102
    103 if err := os.Link(b.file.Name(), b.Path()); err != nil && !os.IsExist(err) {
    • OK, so this is our concurrency control. It creates a hard link from the expected loose blob filename to the temporary file, and the tempfile is removed in the Writer#Close() call.

      The order of calling is Finalize(), then Close()

      The error messages (if any) from Close() are thrown away. The Finalize() ones will result in an error being logged to raven, but won't interrupt completion of the client request.

      It looks like the worst effect of highly concurrent downloads here would be that disc space will be consumed to a maximum of n*blobsize for a short period of time, and if there's a race on the Link call, some harmless errors will go to the log.

    • if there's a race on the Link call, some harmless errors will go to the log.

      Actually, this is a known problem, we use the same mechanism when caching git archive requests. Note the && !os.IsExist(err) on the if. :)

    • The way we dealt with concurrent cache writes (which race each other in a safe way but waste disk write IO) in the git archive case is to just it happen. We don't have metrics on how often this scenario occurs.

    • These cases have strong analogies with the NGINX temporarily-buffer-requests-to-disc behaviour, which is probably why I'm picking up on them at all! Assuming disc space is unlimited can cause problems for us and our customers.

    • Please register or sign in to reply
  • Nick Thomas
  • 43 56 }
    44 57 defer helper.CleanUpProcessGroup(gitShowCmd)
    45 58
    46 w.Header().Set("Content-Length", strings.TrimSpace(string(sizeOutput)))
    47 if _, err := io.Copy(w, stdout); err != nil {
    59 blobWriter, err := looseblob.NewWriter(params.RepoPath, params.BlobId, sizeInt64)
    • But if it doesn't exist, we unconditionally write one. Even if 100 concurrent requests are already doing the same thing.

      When downloads of a large, new blob start, all the requests that start before the first request completes will write a temporary file of the same size as the compressed blob. This could result in large disc usage spikes, and is a lot of unnecessary work.

      Can we skip this step for all requests except for the first one? We just need to keep track of which looseblobs are in the process of being written, and skip the writing step (or swap in a NullWriter) if our looseblob is being written at the moment.

    • Can we skip this step for all requests except for the first one?

      We could but only within a single gitlab-workhorse process; so on gitlab.com with 20 gitlab-workhorse processes the blob may end up being written 20x concurrently. That may still be worth it.

      I like the idea of the NullWriter.

    • Ah of course, it's a shared filesyste and there's a workhorse per server. Limiting the maximum disc usage to n_{workers} rather than n_{requests} is better, but still quite scary to me.

      My pathological case is someone with a git repository containing just one blob, that takes up the whole allocated space they have, abusing GitLab.com as a CDN for it. They make a new release, a few hundred subscribers immediately begin downloading it. It's a little far-fetched, but it shouldn't result in excessive disc use in temporary files.

      We could get it back down to 1 instead of n_{workers } by enhancing the per-request locks inside workhorse with an advisory fcntl() lock. It works on NFS, and it works on ceph, and would allow the workhorse processes on different machines to ensure that at-most-one looseblob is being created across the whole cluster for a given blob.

      If a workhorse gets into a bad state and leaves the lock hanging, it will break caching for that blob, but it won't result in the blob being unavailable to clients.

    • Now that you mention it, we could syscall.Flock indeed. I think it would even be fewer LOC than a mutex-protected hash tracking blobs being written out.

      I am mentioning flock because we already rely on it in gitlab-shell/lib/gitlab_keys.rb, and because it would simultaneously provide a lock inside the process (between goroutines) and between processes (on different NFS clients). If we use fcntl (which locks per PID) we would also need an in-process locking mechanism.

      BTW I appreciate having this discussion!

    • Hah, i'm out of date! In Linux kernels up to 2.6.11, flock() does not lock files over NFS. It's supported by ceph too: http://tracker.ceph.com/issues/23

      However, to my knowledge, flock() is still a per-process, not per-thread, lock: http://stackoverflow.com/questions/9462532/multiple-threads-able-to-get-flock-at-the-same-time

      Workhorse will default to running its goroutines across four OS threads, so we'll still need a mutex around acquiring flock().

    • First of all, flock's interface uses file descriptors. From what I read flock applies to 'file descriptors modulo dup'. So if you open a file, flock the corresponding file descriptor, then dup the fd, then you cannot lock the second fd because it points to the same open file. But if you open the file a second time you get an fd you can flock / unlock independently from an existing flock.

      As long as each goroutine does its own file := os.Open(theLockFile), they should be able to synchronize via syscall.Flock(file.Fd(), options) without needing any mutexes in the process. A mutex would probably be more efficient though, but I would like the duplication of effort of having flock and a mutex. :)

    • Relevant parts of the manpage:

      Flock() applies or removes an advisory lock on the file associated with the file descriptor fd

      Locks are on files, not file descriptors. That is, file descriptors duplicated through dup(2) or fork(2) do not result in multiple instances of a lock, but rather multiple references to a single lock.

      If a process holding a lock on a file forks and the child explicitly unlocks the file, the parent will lose its lock.

      and

      LOCK_EX Place an exclusive lock. Only one process may hold an exclusive lock for a given file at a given time.

      From this: locks are per-file, and per-process. A process, not a thread, holds the lock for the file. I'll put together an example.

    • Interesting. Looks like I'm wrong:

      package main
      
      import (
      	"fmt"
      	"os"
      	"runtime"
      	"sync"
      	"syscall"
      )
      
      func tryFlock(file *os.File) error {
      	return syscall.Flock(int(file.Fd()), syscall.LOCK_EX|syscall.LOCK_NB)
      }
      
      func tryLocks(file1, file2 *os.File, where string) {
      	if err := tryFlock(file1); err != nil {
      		fmt.Printf("Nick is wrong: locking again on the same fd on %s raised error: %s\n", where, err)
      	}
      
      	if err := tryFlock(file2); err != nil {
      		fmt.Printf("Nick is wrong: locking on a different fd on %s raised error: %s\n", where, err)
      	}
      }
      
      func main() {
      	// Lock the main() func to a specific OS thread, for sanity
      	runtime.LockOSThread()
      
          created, _ := os.Create("test")
          created.Close()
      
      	// Create a file to lock
      	file1, err := os.OpenFile("test", os.O_RDWR, 0644)
      	if err != nil {
      		fmt.Printf("Weird: %s\n", err)
      		os.Exit(1)
      	}
      
      	// Get a second file descriptor for it
      	file2, err := os.OpenFile("test", os.O_RDWR, 0644)
      	if err != nil {
      		fmt.Printf("Weird: %s\n", err)
      		os.Exit(1)
      	}
      
      	// Acquire a per-process exclusive lock. This should succeed
      	if err := tryFlock(file1); err != nil {
      		fmt.Printf("Weird: %s\n", err)
      		os.Exit(1)
      	}
      
      	// Now let's try to acquire the lock again. This will succeed because it's a
      	// per-process lock. Doesn't matter which fd we use.
      	tryLocks(file1, file2, "the same OS thread")
      
      	// Now let's try locking on a different OS thread. The flock calls will
      	// succeed there too, beause it's a per-process lock
      	var wg sync.WaitGroup
      	wg.Add(1)
      	go func() {
      		runtime.LockOSThread()
      		tryLocks(file1, file2, "a different OS thread")
      		wg.Done()
      	}()
      	wg.Wait()
      }

      Gives the output:

      Nick is wrong: locking on a different fd on the same OS thread raised error: resource temporarily unavailable
      Nick is wrong: locking on a different fd on a different OS thread raised error: resource temporarily unavailable

      It does this on mac and linux both, although I've not tested it on top of nfs or ceph.

      Clearly I need to go away and read the manpage and/or kernel source more carefully!

    • I guess when they say 'file' they're referring to the structure in the kernel that the fd refers to, rather than the file on disc.

    • Lol at the 'Nick is wrong' log messages :)

    • Please register or sign in to reply
  • Nick Thomas
  • @jacobvosmaer-gitlab first review done. If we can get the disc space worst case down to double the allocated repository size, I think this should be fine :thumbsup:

  • Thanks for the review Nick! I will respond to your comments.

  • In the worst-case scenario, a project could use double its allowance.

    Actually, if I take the gitlab-ce repo as an example, in a fully (pretty much optimally) packed state it uses 160MB of disk space, and when unpacking all its loose blobs it adds 1.5G (on top of the packfile which the code in this MR would not remove).

    So let's call it 10x more disk space in the worst case.

  • Regarding loose blob clutter: for repositories with regular push activity this is 'fixed' in the sense that those will incur git repack and git gc from gitlab-ce, which remove all loose blobs. The only problem is with repositories with infrequent or not push activity. These could be unpacked completely by successive requests.

    My idea for handling this was:

    • when handling blob read requests randomly spawn a 'check job' with probability 1:1000 (so most blob read requests will not spawn the job)
    • the 'check job' counts the number of (regular) files in the objects/ directory; if it finds more than say 1000 it runs git repack

    In essence, treat 1/1000 of read requests as a Git push and clean up accordingly.

    Now that I write this perhaps the simplest thing we can do is to increment the 'pushes since gc' counter in lib/gitlab/workhorse.rb when preparing a 'send blob' response on 1 in a 1000 requests.

  • The downside of re-using the 'pushes since gc' mechanism is that you automatically clear the loose blob cache whether that's warranted or not, and that once in a while you trigger a full git gc which is relatively expensive. Still, it is appealing because it is so simple.

  • added 2 commits

    Compare with previous version

  • On second thought, a system where reads slowly increment the 'pushes since gc' counter would show unwanted behavior when a single blob in a repository is requested excessively often (which is one of the cases we want to deal with). So I think we cannot get around the 'measured' response of first counting files and only clearing the blob cache if it has too many entries.

  • So we have two fundamental improvements that this MR currently lacks:

    The question is, do we want either or both of these from the start, or can we ship without in the first iteration.

    cc @pcarranza

    Edited by Jacob Vosmaer (GitLab)
  • My propose approach is to add metrics (cache hits, misses, bytes written to disk, (??) bytes read from disk) and not cover all our bases in the first iteration.

  • Yeah, I think we should really add metrics first. It would help make sure that using loose object files as cache is worth it in the first place.

  • I agree @jacobvosmaer-gitlab let's learn about this first.

  • If we declare it an experiment we should maybe also have an on/off switch, defaulting to off.

  • I don't think we can have this in 8.15.

  • changed milestone to %8.16

  • Jacob Vosmaer (GitLab) marked as a Work In Progress

    marked as a Work In Progress

  • We will build this in gitaly instead. No milestone, it will come up when it comes up.

  • Please register or sign in to reply
    Loading