Skip to content

Add table for files in merge request diffs

username-removed-443319 requested to merge merge-request-diffs-table into master

This is an experiment based off https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11821#note_31155127. I don't expect this to actually work beyond my happy-path tests, or to be merged soon.

Shiny features

As it stands, this enables no new features, and works exactly as before - except for the things it breaks. Things we could consider (but shouldn't necessarily do) in future are:

  1. Paginate MR diffs.
  2. Search for MRs that touched a file path.
  3. More generally, allow this content to be indexed by Elasticsearch, so we could even search for MR diffs with particular content.
  4. If we stored the size of each diff file, we could even implement 'collapsing' in the DB, but I'm really not sure that's worth it.
  5. We could store the diffstats for each file to help with https://gitlab.com/gitlab-org/gitlab-ce/issues/20282.

Table design

  1. It doesn't have any ID column, because we have 2.4 million rows in merge_request_diffs in staging (I'm sure it's many more in production). As this table will have many more rows than that table, we'd need bigint IDs if we did use them here, or we'd run out of ints very quickly.
  2. The relative_order column is only used for preserving the order of diff files. It's possible we could do this by indexing the paths instead. I'm not sure which is better, this was just easier to implement.
  3. I thought about using a bit string column for the modes, but ActiveRecord doesn't seem to support it, and we will never query on it.
  4. There is only one index, which is kind of a cheat because it's a multi-column. I don't know if this would be better broken into two indices.

Why do this?

We don't like serialised columns: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11821

In profiles like http://profiler.gitlap.com/20170524/4516541a-0ff2-4cf7-90f7-92c1c97b32cc.html.gz, we see two separate entries for ActiveRecord::Coders::YAMLColumn#load, taking 20% and 13% of the total time respectively. I have done some tests locally on a huge MR and not seen a huge difference, though.

On staging, merge_request_diffs takes up 126 GB. Of that, 95% is the two serialised columns: 23 GB of commits, and 96 GB of diffs. While this won't make the table smaller (I have no plans to migrate the existing diffs), it will stop it growing quite so fast. In future, we could perhaps consider moving st_diffs to their own table, to remove the need to query the MR diffs table without that column, which we do now in some cases.

Edited by username-removed-443319

Merge request reports