WIP: Prune unreferenced git LFS objects

Review changes
Download
Patches
Plain diff

James EJ requested to merge jej/prune_unreferenced_git_lfs_objects into master Sep 25, 2017

Overview 93
Commits 26
Pipelines 15
Changes 32

What

Tracks LFS pointer blobs and uses these to remove LFS objects which are no longer referenced

Why

When references to large files have been removed they shouldn't be kept around

ReferenceChange approach

On push:

Store the name and newrev of the updated ref (000->abc master)
Schedule worker to find new LFS pointers from that change

Update LfsPointer worker:

Find new blobs by using rev-list to identify blobs in the branch which are not included in already processed refs.
- Limit list of processed refs to the latest for each ref name and to N latest overall to avoid handling thousands of refs.
- Clean up entries past the 100 most recent to avoid the table becoming too large
Find new LFS pointers from those blobs and store them in the database

On Gc:

Ignore projects with reference changes to process
Ignore projects which havent't had existing pointers processed
Remove LfsPointers which no longer exist in the project
Remove LfsObjectProjects/LfsObjects which are no longer referenced by pointers

Rational

The main trade off is memory+database space vs extra blob lookups on the NFS disk.

Storing the list of processed refs allows us to eliminate blobs in commits which have already been checked for LFS pointers. This also works for objects introduced by similar commits as any objects introduced by both C and and C' can be eliminated by rev-list C' --not C --objects. When a new branch is pushed only new objects are checked for the same reason.

This approach guarantees both that blob lookups are kept to a minimum, and that all pointers have been found by default. It holds that if all pushes / RefrenceChanges have been processed that all Lfs pointers have been found, making it safe to delete those which are in the database but no longer on disk. Without this there could be LfsObjects in a project for which we have found one pointer but not another, and end up deleting a LfsObjects which are still referenced by unfound pointers.

OldRev NewRev approach

Instead of storing reference changes we could schedule the worker by passing `oldrev` and `newrev`. On initial push of a branch we'd check if it was the `default_branch` and scan all objects reachable from `newrev` for that branch or all objects reachable but not included in that branch for other pushes. We might need to track if the default branch has been scanned in this approach, as well as being extremely careful with changes to the default branch. Failure to do so could result in additional pointers not having been added to the db, and consequently cause the LfsObject to be deleted when one known pointer is removed. ~~This approach would be simpler but can end up rescanning blobs for every push in some cases. See https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14479#note_42252165~~ Additionally a mechanism would be needed to prevent garbage collection / cleanup while there are unprocessed pushes. This could be an exclusive lock around the push, or storing oldrev/newrev in the database temporarily in a similar manner to the `ReferenceChange` approach. The second scenario uses the database as a queue and would delete old records once processed. Either way, we’d need to be 100% sure that these actually get processed otherwise we’d end up deleting LFS objects.

Are there points in the code the reviewer needs to double check?

Todo

Find all pointers of first push
Clean up RecentLfsPush entries past the 100 most recent to avoid the table becoming too large
Mysql for finding 100 most recent refs
Add indices for columns used in queries
Add database checklist
Add performance testing plan to description (towards making a case why this won’t perform badly on production)
Ping someone for Gitaly review
Ping someone for database review
Performance review

Things I'll MR open discussions

What happens if multiple pushes occur before/during first run? Would we have multiple workers scanning the whole project? Could we benefit from a lock around UpdateLfsPointersWorker per project?
Gitlab::Git::Blob.batch_lfs_metadata should bypass gitaly
Better way to get all blobs? Possible to get all blobs within size range?

Performance

TODO

Ideas:

Generate 10,000s of LfsObjectProject, etc and test
Set up test instance and find memory characteristics
Add one LFS object to linux project and test it
Lookup current count of LfsObjectProject items to find current scale

Benchmarks

TODO

Database checklist

For added migrations:

Updated db/schema.rb
Added a down method so the migration can be reverted
Added the output of the migration(s) to the MR body
Added the execution time of the migration(s) to the MR body
Added tests for the migration in spec/migrations if necessary (e.g. when migrating data)
Made sure the migration won't interfere with a running GitLab cluster, for example by disabling transactions for long running migrations

For added tables:

Ordered columns based on their type sizes in descending order
Added foreign keys if necessary
Added indexes if necessary
- Described the need for these indexes in the MR body
- Made sure existing indexes can not be reused instead

For potentially slow queries:

Included the raw SQL queries of the relevant queries
Included the output of EXPLAIN ANALYZE and execution timings of the relevant queries

Acceptance criteria

Changelog entry added, if necessary
Documentation created/updated
Tests added for this feature/bug
Review
- Has been reviewed by UX
- Has been reviewed by Frontend
- Has been reviewed by Backend
- Has been reviewed by Database
Conform by the merge request performance guides

Closes https://gitlab.com/gitlab-org/gitlab-ce/issues/30639

Edited Oct 06, 2017 by James EJ

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Time tracking Loading

Admin message