We currently serialise the commits present in an MR version in the merge_request_diffs.st_commits column. This is fine for storing and rendering the commit objects for a single MR, but makes it hard to find:
@smcgivern I believe we had a lot of discussion in the past about this. Not sure which issue it was. In my opinion, we seriously need to do this at some point, but we also need to be very careful about using database resource for this. We could overload the commits table very quickly if we're not careful enough.
There's a similar issue in my mind for quite some time, too. There's no easy way to find the merge requests from a given pipeline. Current approach is: Load all merge requests with the same branch from the project, load the pipeline which contains the same branch and diff_head_sha, compare if they're the same. This is definitely expensive... and we need this in the pipeline emails, linking to the merge requests people are looking for. This is probably unrelated to commits at all, but I wish we could also keep this in mind and fix this at some point.
@godfat it is related, isn't it? If you have a table linking commits to MRs, then you can find the commit in that table to find the MRs first, so that you don't need to just go the branch name?
@yorickpeterse can help with what's reasonable to store in the DB, but this makes sense to me. If we assume that the notes table has X entries, will the commit-MR relationship really be more than one order of magnitude bigger? The table itself can also be a lot narrower, as it's just a join between a DB object and a repository object.
@grzesiek That's what I was referring to. Thanks! I am so bad at searching for issues... I tried to search against "commits" and "database" or "store" or "save" -_-
I'd like to start small with something that would solve the problems of knowing if a commit was in an MR, or if an MR contained a commit.
We have an ID-less table, merge_request_diff_commits:
merge_request_diff_id
This is a foreign key to merge_request_diffs.id, so that we can delete when the MR versions are deleted (typically when the project is deleted).
commit_sha
I think we'll need indexes on both of these, because we need to query on both. We will never update rows in this table, just insert and delete.
We can then just use the merge_request_diffs table like an ordinary join table, as well as storing the diffs themselves, to:
Get the MRs containing a commit.
Get the commits in an MR.
Get the comments on commits in an MR.
Because we will be enabling #24323 (closed) with this, I don't think we need the notion of a 'latest' MR diff, either in the merge_request_diffs or merge_request_diff_comments table: we will always want all comments on all commits in all versions in the MR. If we do want this, I would suggest adding a latest column to the merge_request_diffs table, instead of the existing approach where we do ORDER BY created_at DESC LIMIT 1.
If we do the above, then the big problem is backfilling. We already have a serialised merge_request_diffs.st_commits column, which this will be duplicating. Ideally, we would migrate them at once, but we have literally millions of these in staging (2,412,774 to be precise), and presumably more in production.
My initial proposal is to do something like this:
Write the commits for new MR diffs to this new table.
Support collecting commits from st_commits OR the new table. (We can just join to the new table, and if there are no entries, fall back to the serialised field.)
Add code to use this as we have time, as in the above-linked issues.
Create a worker that runs every hour.
That worker picks, say, 1,000 MR diffs that haven't been migrated (SELECT * FROM merge_request_diffs WHERE st_commits IS NOT NULL) and converts their st_diffs column to entries in this new table, nulling out the existing serialised column.
Leave this worker in place for a long time (a year's worth of releases? Until 11.0?). When we're ready to remove it, create a post-deployment migration to clean up any remaining problem diffs (there shouldn't be any on GitLab.com at this stage, but we want to be kind to administrators).
In the next release, remove the st_diffs column.
I realise that this is a very long-term project, but I think if we don't start, we'll never finish it Any thoughts?
@smcgivern The proposal for table structure looks reasonable to me!
I have some questions / suggestions to the backfilling procedure:
Can we use this new table for these new features before it's fully backfilled? I'd say that it may give surprising results.
I think that at the beginning we should write both to st_commits and this new table, but always read from st_commits.
Once all the data is backfilled we can switch to reading from this new table.
I'm not sure if I like this periodical worker approach. I understand that we can't do it in a single run, but I'm still thinking if we have any better alternative.
@smcgivern My only concern is that is it possible that merge_request_diff_commits would grow to a point we need to shard it? I don't know, just asking.
@adamniedzielski I think we could also consider that whenever a merge request is viewed, we backfill for that particular merge request first if it's not done yet. I don't really like the idea of having to fill both data though, because I won't be sure when we could really do the transition. I would prefer to have a more clear path forward, or at least a progress we could track when we could stop it.
@adamniedzielski if we wait until the table is backfilled, we will wait a long time. If we have to support both, I don't really see the point in writing to the legacy column, and that way we know that if the legacy column is NULL, we have data in the new format.
I'm not sure if I like this periodical worker approach. I understand that we can't do it in a single run, but I'm still thinking if we have any better alternative.
I agree. This is basically a proposal to see what's wrong, but I don't think we can necessarily use it as-is.
One extra column I thought of: it might be useful to know if the commit was the HEAD of its MR version, for finding commits with keep-around refs. I think this could go in the merge_request_diff_commits table.
@smcgivern Your proposed database design makes sense, and using a latest column is definitely better than constantly ordering data on the fly.
Storage wise this table will indeed be quite large, but we can anticipate the size/growth. Each row contains two columns: an integer (= 4 bytes) referring to merge_request_diffs.id, and a commit_sha, which is a 40 byte string. This means that for every row we need at least 44 bytes. At this size we can store 244 032 232 rows before consuming 10 GB of data, which isn't all that much (e.g. our events table is 134 GB). If we ever reach a point where much more for this table we can just configure PostgreSQL to store the table on a separate disk, giving us about 32TB (= the Azure limit per disk if I'm not mistaken) of available space.
In other words, I don't think storage will be a big issue (even with indexes) as long as we are careful about not adding endless amounts of columns to this table (e.g. like we do with the users table).
Backfilling on the other hand is a much bigger issue. Doing this in a regular migration will more or less take forever. Doing this in the background in parallel (e.g. 1 job per MR) would be much faster, at the cost of requiring a worker just for this migration (something I'm not a fan of). Another option is to fill the data on the fly whenever you request an MR, but this means that:
We need to keep st_commits around forever
It's possible we never migrate all data as some MRs may never be viewed again
Another problem with a background process is that while the migration is running the commits of an MR may not be visible, since the code is using this new table but data may not be there yet. This could disrupt operations of a GitLab instance.
I'm not sure what the best solution here is, so I think we need to list down the pros and cons of all approaches along with an estimated amount of hours/days needed for the migration to run.
@yorickpeterse thanks. I also just realised I can take the size of the serialised column on staging as a reasonable estimate, because this table is doing essentially the same thing, but in a form that we can query. On staging, the st_commits column is about 22 GB. (This is partly because it serialises other information about the commit, like the author and date, which we never use ... because we just load the commit from the repo by SHA when we need an object.)
I'm listing some backfilling approaches here, with pros and cons. I'm sure there are other approaches, and other pros and cons, so please add them
Assumptions:
Once we add this new table, we will stop writing to st_commits and only write to the new table, effective immediately. This is so that our backfill process, whatever it is, is always working on a decreasing set of rows (those with non-null st_commits).
We will handle data in both places identically. (If there are no results from joining to merge_request_diff_commits, we read from st_commits. If there are no results there, the MR diff is empty.)
Whether or not we add any new features to take advantage of this immediately, or only once all data has been migrated, is a side question for now.
Backfill options:
Migrate this data in a post-deployment migration. Whatever other method we use, we probably need to do this as well at some point, in case people have not let whatever mechanism we use backfill the entire table.
Pros: very simple.
Cons: even if we could do this without loading models into Ruby, this would probably be too slow to run on GitLab.com. As it is, we do need to load each object we're migrating in Ruby to deserialise the value. (We can do a single bulk INSERT per N objects loaded in Ruby, though. Reading is the bottleneck.)
Estimate: very rough, but on staging it takes 1-90 seconds to load and deserialise a sample of 1,000 rows from this table. (Some MR diffs have very few commits, some have lots.) If we say it took 45 seconds per 1,000 rows, then it would take 30 hours just to read all of these - let alone perform the inserts, or the load this would place on the table.
Migrate data on read.
Pros: we never have to read more than we would have anyway (because we are only doing this when reading as a result of user action).
Cons: I would guess that most merged MRs, and their associated diffs, are never loaded. 20% of the existing MR diffs seems really high to me (given that MRs have multiple versions), but even if it was that high, that would leave us with around 2 million rows still to migrate. Also, we can't write to Geo secondaries, so there's an EE consideration.
Estimate: well, if we assume we just schedule a Sidekiq job on read, then this could be pretty fast. But it would basically never get to a point where we could remove the old column - not on GitLab.com, and not on other installations - so we'd have to combine it with another method.
A background worker that is 'first-class' in the app until some future time (I suggested 11.0 earlier, but it could be any release).
Pros: will backfill everything (eventually).
Cons: we have to support this for a long time, and it may take a long time to backfill all the data. Picking the rate of backfilling is hard.
Estimate: if we did 1,000 MR diffs an hour, which seems OK based on the rough samples in 1, we would do around 700,000 a month. That would take just under four months. If we did 5,000 an hour, we would therefore do around 3,500,000 a month, so 10,000 an hour would basically guarantee this was done in less that a month.
@smcgivern Considering the amount of data that needs to be migrated I think this might be the rare case where we have to perform the migration using Sidekiq. On production for example it takes around 4 seconds alone just to get 1000 st_commits values and parse the JSON.
Since we probably need this more often I'm thinking of the following setup:
We add a worker called BackgroundMigrationWorker. This worker takes 2 arguments:
The class (including the full namespace) to run
The arguments (as an Array) to pass to the class' perform method
We define a migration class for this work, and call it something like Gitlab::BackgroundMigration::MigrateMergeRequestStCommitsBlaBla
We use a regular post-deployment migration to schedule jobs for this worker. This migration should not automatically parse any JSON, instead it should select the blob as a String. This ensures scheduling data is as fast as possible. To further speed things up we need to select multiple rows (instead of 1 by 1), and schedule multiple jobs using Sidekiq's bulk API
We just keep the migration class around like a regular migration.
We add tests for the migration class like any other regular class
This particular setup is not too complex, and allows us to re-use things in the future. The DB migration would look something like this:
classMergeRequest<ActiveRecord::Base# redefine MergeRequest so we can use it below without ActiveRecord automatically parsing JSONenddefupMergeRequest.select([:id,:st_commits]).in_batches_whatever_it_was_calleddo|batch|jobs=batch.map{|mr|['Gitlab::BackgroundMigration::MigrateStCommitsWhatever',[mr.id,mr.st_commits]]}Sidekiq::Client.push_bulk('class'=>'BackgroundMigrationWorker','queue'=>'background_migration','args'=>jobs)endend
@yorickpeterse that sounds like a really good idea, and it would be useful in other cases, too. I've got a couple of questions:
How much storage will this use in Redis? Do we have enough? Do we expect users to have enough, if they have a lot of MR diffs but maybe not a particularly beefy Redis setup? We'll be pushing (for the arguments) at least 22 GB of data.
Should we make this worker create a queue for each migration? At the moment we have only one, but we may have more running concurrently in future.
Are we concerned about these jobs blocking other jobs - should we use 0.5 as the weight for this queue?
This doesn't (yet) solve the problem of dropping that column, but I think we can just schedule that for a major version in the future?
How much storage will this use in Redis? Do we have enough? Do we expect users to have enough, if they have a lot of MR diffs but maybe not a particularly beefy Redis setup? We'll be pushing (for the arguments) at least 22 GB of data.
Good point, perhaps it's better to just schedule the MR IDs and let the worker query the corresponding row.
Should we make this worker create a queue for each migration? At the moment we have only one, but we may have more running concurrently in future.
No, there should be 1 queue for all migrations. Since jobs can be processed in parallel there's no particular advantage to having multiple queues.
Are we concerned about these jobs blocking other jobs - should we use 0.5 as the weight for this queue?
Yes, migrations should have a low weight so Sidekiq will prioritise other queues.
This doesn't (yet) solve the problem of dropping that column, but I think we can just schedule that for a major version in the future?
I think we can best approach this as follows:
The background migration will clear st_commits for a migrated row once it has migrated the data (and is 100% sure the new data is in place so we don't lose anything)
We remove the column in 10.0, giving people plenty of times to run the migrations
This ensures we don't keep GBs of data around until 10.0.
We don't strictly need a latest column after all, as we have a head_commit_sha column on merge_request_diffs, which should always match the latest commit in the table.
We do actually load the hash, and we won't have ordering for the commits in this new table, but we do actually use the ordering in loading the commits, because we just map them to real commits in the repository.
Maybe we need an integer order column, that represents the index of a commit in its MR diff.
Also, it looks like the benefit of storing the commit as a hash is that we don't need to hit the repo at all for the commits. I just had a chat with @ayufan because we both believe it did require FS access (https://gitlab.slack.com/archives/C02PF508L/p1479820954006919), but it turns out that it doesn't.
I think the trade-off is worth it, because as far as I can see, we only use the actual commit objects on the commits tab, in which case we could just as well get them from the repo. (We already load this tab async.)
Most other places, we just use the SHAs, which will now be more efficient to get (no deserialisation penalty). If we only need the ordering when getting the commits from the repo, we could also just let git handle that, and get the commits between the start and head SHAs.
The problem comes with old MRs, I guess, where the commits may not have been kept around in the repository, and so we can only get them from the DB. @DouweM@dzaporozhets do you know if that will be a problem?
@DouweM thanks. Then I think part of this should be to try to display MRs where we can't fetch those refs sensibly, even if we can't show the full data.
Should we store the SHAs in their own commits table?
Advantages:
We save on storage size for the columns that reference SHAs (we're storing two integers in this proposed table, instead of a 40-byte string and an integer). This isn't a benefit when only one table references this commits table, but is everywhere else.
If we do need to store some commit metadata in the DB, we can do this for everything that references that commit.
Helps with the 'really remove this commit from the repo' case.
Disadvantages:
The SHA is already the natural key for commits.
It's an extra JOIN in a lot of places.
This table would initially just have one column, the SHA. Other tables would reference it and use foreign keys.
@smcgivern It's worth mentioning that for meta data we eventually want to use Gitaly, so storing metadata also in the DB will result in duplication of data. Apart from these MR diffs, what would this SHAs table be used for?
@yorickpeterse for the metadata, I have no idea, but the Gitaly point is a good one.
In general, the table as proposed only works for MRs, so doesn't help with comments or todos on commits, or pipelines (which also refer to commits).
I think migrating this across in future would also be possible, though, if we did decide we wanted a commits table with id and sha columns, so we can probably ignore this for now.