Test higher diff size limits
What are we going to do?
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/11875 adds a feature flag to increase the diff size limits for displaying and collapsing the diffs respectively. We are going to test those new limits for a period of 4 hours.
For this we will enable the feature flag gitlab_git_diff_size_limit_increase
by setting it to true.
Why are we doing it?
We never actually tested the previous limits, we just picked ones that seemed reasonable. These new limits also seem reasonable, but we should experiment first
We think we can increase the limits because we have since added other limits to diffs, based on the size of the entire diff, which should limit the impact while being more user-friendly for diffs where few files changed, but they changed a great deal.
When are we going to do it?
- Start time: 2017-06-27 10:00 UTC
- Duration: 4 hours
- Estimated end time: 2017-06-27 14:00 UTC
How are we going to do it?
- Someone with admin access: set the
gitlab_git_diff_size_limit_increase
feature flag totrue
. - Me: monitor these dashboards:
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::MergeRequestsController%23diffs.json
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::MergeRequestsController%23new_diffs
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::MergeRequestsController%23new_diffs.json
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::MergeRequestsController%23diff_for_path.json
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::CompareController%23show
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::CompareController%23diff_for_path.json
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::CommitController%23show
- https://performance.gitlab.net/dashboard/db/rails-controllers?var-action=Projects::CommitController%23diff_for_path.json
- Someone with admin access: set the
gitlab_git_diff_size_limit_increase
feature flag tofalse
.
How are we preparing for it?
Once 9.3.0-rc2 is deployed, we just need to pick a time that is convenient (any fires after the deploy put out, no subsequent deploy planned).
What can we check before starting?
The timings for these controller actions are very noisy, but we can pick specific public URLs - ideally on GitLab projects - and use those as our baselines. We'll specifically pick these:
- The current CE -> EE merge. This is typically a large MR with lots of changes.
- The most recent release post. This normally hits the existing collapse limit, but the overall diff isn't very big.
- https://gitlab.com/nrclark/dummy_project/commit/81ebdea5df2fb42e59257cb3eaad671a5c53ca36, which is a common 'stress test' of diffing.
The first and last shouldn't get slower. The middle one may get slightly slower - I'd accept a 5% increase in timings.
What can we check afterwards to ensure that it's working?
We can check the release post MR for an example. We should also pay close attention to the response timings for the above charts. We do expect all response timings to increase slightly - although in the worst case, not at all (because of the limits on the overall diff size).
The diff_for_path
actions should see fewer, slower transactions as they will now only be called for larger diffs. These are called async anyway.
Impact
- Type of impact: client facing.
- What will happen: people will see more diffs rendered by default, at a cost in load times. Hopefully the cost in load times is small.
- Do we expect downtime? (set the override in pagerduty): no.
How are we communicating this to our customers?
- Tweet before and after the change.
- Do we need to set a broadcast banner?: no.
What is the rollback plan?
We will disable the feature flag at the end of the period. If we see problems caused by this during the period, we can disable it at any time.
Monitoring
- Graphs to check for failures:
- See the above list of links.
- Graphs to check for improvements:
- See the above list of links.
- Alerts that may trigger:
- None.
Scheduling
Schedule a downtime in the production calendar twice as long as your worst duration estimate, be pessimistic (better safe than sorry)
When things go wrong (downtime or service degradation)
- Label the change issue as outage
- Perform a blameless post mortem