Database Outage on 2016/11/28 when project_authorizations had too much bloat

For those reading: the project authorisations problem is solved by https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7733 which will be released in 8.14.2.

Thanks @yorickpeterse

I would like to move the conversation into understanding what happened to get all that table bloat and how it flew undetected for so long

Marked the task Add availability graphs to the fleet overview dashboard for clarity as completed

@pcarranza Prior to https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7733 the code would use a serializable transaction. The idea was to prevent the data from the same user being updated concurrently as this could potentially result in an inconsistent state. This worked as follows:

loop until success
  start serializable transaction
    delete existing entries for user X
    insert new entries for user X
    if commit failed
      retry

While the removal worked as inspected the queries would block each other on the insert. This meant that if data had to be refreshed for user A, B, and C all of this would happen in serial. Because this code would try to run until success (much like your typical compare-and-swap loop) this could lead to a lot of queries being executed (and failing). I suspect that due to the DELETE succeeding this would create a lot of dead tuples, though I expect that not to happen until a transaction commits (but I'm not familiar enough with these internals to be certain). This would explain the VACUUM overhead, the space usage, the dead tuples, etc.

The new setup is basically:

lock entries for user X
start regular transaction
  delete entries for user X
  insert new entries for user X
  commit
unlock entries for user X

This ensures there's ever only 3 queries being executed (DELETE, SELECT, and an INSERT), and there's no busy loop needed anymore.

@yorickpeterse I was writing an answer and started wondering.

This graph is showing the spike in dead tuples:

But it doesn't explain why we had so much storage used:

Taking a second look at dead tuples in the same timespan look what I found:

So an 8M dead tuple peak is not a peak, it's barely an annoyance.

There you go... we need to alert whenever we reach 100M dead rows, or we need to have a way of alerting whenever we see a huge increase in dead rows and storage usage for a given time. I've no idea how to do this yet.

Good thing that we fixed the Tuple Texas Chainsaw Massacre feature, @yorickpeterse

Support received a request (handled by @athar) from a GitLab.com user on 2016/11/23 regarding an issue "not seeing all created projects on the dashboard page". As @athar did not have staging/production access he was unable to debug the issue (https://gitlab.com/gitlab-com/infrastructure/issues/801 + improvements to SE onboarding). This was escalated and debugged on 2016/11/29. Steps taken:

Impersonate the user and review dashboard - missing a number of created projects
Load the rails console and review current_user.authorized_projects - same result
Check the users authorized_projects_populated attribute = true
Updated authorized_projects_populated to false and refreshed the users dashboard - issue resolved

We attempted to first debug this issue on staging.gitlab.com (before screwing with production), however it was borked - https://gitlab.com/gitlab-com/infrastructure/issues/802

We should work to identify these occasional issues when first reported (via support). This will involve having cross-team communication from support to dev or infra if we receive a report from a GitLab.com user describing unusual behaviour.

@MrChrisW I fail to see how this relates to the outage specifically. But I agree with the comms feedback.

@pcarranza I'm not saying it relates to this outage specifically, I'm making the point that we saw the symptoms related to https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7733 early (-1day) before it started causing problems in production.

Makes sense, thanks @MrChrisW

We have another tuple chainsaw massacre going on right now

I'll clean up the pile of dead bodies when the dust settles.

Mentioned in issue support-forum#1313 (closed)

Another event today morning with the same behavior and resolution

Again, vacuum verbose project_authorizations and vacuum full project_authorizations to reclaim space.

What do we have to do to prevent this from happening again?

@stanhu Stop generating so much bloat in the project_authorizations table, or start using pg_repack on this table as it seems to be the main bloat generator available.

@yorickpeterse also mentioned reviewing the way we are updating this table as it seems that we are being too aggressive by basically blowing all the rows and generating them again.

From the alerting standpoint, I just added an alert for db high load with a lot of information, and we should probably alert whenever we see that there is a high dead tuple generation rate to get us ready. It will not prevent this from happening, but it will at least warn with some time to be ready.

@pcarranza Now that we essentially use a lock instead of serializable transactions it should be easier to perform a diff based update. I'll see if I can piece this together next week.

Thanks @yorickpeterse Issue you created is: https://gitlab.com/gitlab-org/gitlab-ce/issues/25257

marked the task Add alerts for database high load as completed

marked the task Add alert for high number of slow queries in the database as completed

marked the task Add alert for high rate of dead tuples per minute (10k for more than 5 minutes is a lot and will bring up issues like this again) as completed

marked the task Add low level host metrics to the database dashboard as completed

marked the task Add database helper scripts to the chef recipe so we can always find the scripts and we don't need to build them while in the middle of an outage. as completed

I'm closing this issue for now as I'm not clear the next step is to use pg_repack. I think we are going in a different direction here.

closed

Database Outage on 2016/11/28 when project_authorizations had too much bloat

TL;DR

Timeline

Graphs

How storage dropped when the outage was resolved

General view of the timeline

DB1 host metrics

Wide view of DB1 storage usage

What went wrong

What can be better

Designs

Child items 0

Activity

Admin message

Admin message

Database Outage on 2016/11/28 when project_authorizations had too much bloat

TL;DR

Timeline

Graphs

How storage dropped when the outage was resolved

General view of the timeline

DB1 host metrics

Wide view of DB1 storage usage

What went wrong

What can be better

Activity