Skip to content

WIP: Prune unreferenced git LFS objects

James EJ requested to merge jej/prune_unreferenced_git_lfs_objects into master

What

Tracks LFS pointer blobs and uses these to remove LFS objects which are no longer referenced

Why

When references to large files have been removed they shouldn't be kept around

ReferenceChange approach

On push:

  1. Store the name and newrev of the updated ref (000->abc master)
  2. Schedule worker to find new LFS pointers from that change

Update LfsPointer worker:

  • Find new blobs by using rev-list to identify blobs in the branch which are not included in already processed refs.
    • Limit list of processed refs to the latest for each ref name and to N latest overall to avoid handling thousands of refs.
    • Clean up entries past the 100 most recent to avoid the table becoming too large
  • Find new LFS pointers from those blobs and store them in the database

On Gc:

  1. Ignore projects with reference changes to process
  2. Ignore projects which havent't had existing pointers processed
  3. Remove LfsPointers which no longer exist in the project
  4. Remove LfsObjectProjects/LfsObjects which are no longer referenced by pointers

Rational

The main trade off is memory+database space vs extra blob lookups on the NFS disk.

Storing the list of processed refs allows us to eliminate blobs in commits which have already been checked for LFS pointers. This also works for objects introduced by similar commits as any objects introduced by both C and and C' can be eliminated by rev-list C' --not C --objects. When a new branch is pushed only new objects are checked for the same reason.

This approach guarantees both that blob lookups are kept to a minimum, and that all pointers have been found by default. It holds that if all pushes / RefrenceChanges have been processed that all Lfs pointers have been found, making it safe to delete those which are in the database but no longer on disk. Without this there could be LfsObjects in a project for which we have found one pointer but not another, and end up deleting a LfsObjects which are still referenced by unfound pointers.

OldRev NewRev approach

Instead of storing reference changes we could schedule the worker by passing `oldrev` and `newrev`. On initial push of a branch we'd check if it was the `default_branch` and scan all objects reachable from `newrev` for that branch or all objects reachable but not included in that branch for other pushes. We might need to track if the default branch has been scanned in this approach, as well as being extremely careful with changes to the default branch. Failure to do so could result in additional pointers not having been added to the db, and consequently cause the LfsObject to be deleted when one known pointer is removed. This approach would be simpler but can end up rescanning blobs for every push in some cases. See https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14479#note_42252165 Additionally a mechanism would be needed to prevent garbage collection / cleanup while there are unprocessed pushes. This could be an exclusive lock around the push, or storing oldrev/newrev in the database temporarily in a similar manner to the `ReferenceChange` approach. The second scenario uses the database as a queue and would delete old records once processed. Either way, we’d need to be 100% sure that these actually get processed otherwise we’d end up deleting LFS objects.

Are there points in the code the reviewer needs to double check?

Todo

  • Find all pointers of first push
  • Clean up RecentLfsPush entries past the 100 most recent to avoid the table becoming too large
  • Mysql for finding 100 most recent refs
  • Add indices for columns used in queries
  • Add database checklist
  • Add performance testing plan to description (towards making a case why this won’t perform badly on production)
  • Ping someone for Gitaly review
  • Ping someone for database review
  • Performance review

Things I'll MR open discussions

  • What happens if multiple pushes occur before/during first run? Would we have multiple workers scanning the whole project? Could we benefit from a lock around UpdateLfsPointersWorker per project?
  • Gitlab::Git::Blob.batch_lfs_metadata should bypass gitaly
  • Better way to get all blobs? Possible to get all blobs within size range?

Performance

TODO

Ideas:

  • Generate 10,000s of LfsObjectProject, etc and test
  • Set up test instance and find memory characteristics
  • Add one LFS object to linux project and test it
  • Lookup current count of LfsObjectProject items to find current scale

Benchmarks

TODO

Database checklist

For added migrations:

  • Updated db/schema.rb
  • Added a down method so the migration can be reverted
  • Added the output of the migration(s) to the MR body
  • Added the execution time of the migration(s) to the MR body
  • Added tests for the migration in spec/migrations if necessary (e.g. when migrating data)
  • Made sure the migration won't interfere with a running GitLab cluster, for example by disabling transactions for long running migrations

For added tables:

  • Ordered columns based on their type sizes in descending order
  • Added foreign keys if necessary
  • Added indexes if necessary
    • Described the need for these indexes in the MR body
    • Made sure existing indexes can not be reused instead

For potentially slow queries:

  • Included the raw SQL queries of the relevant queries
  • Included the output of EXPLAIN ANALYZE and execution timings of the relevant queries

Acceptance criteria

Related

Closes https://gitlab.com/gitlab-org/gitlab-ce/issues/30639

Edited by James EJ

Merge request reports

Loading