Site degradation week 25/09/2017 to 28/09/2017

This issue is a collection of what was worked on during the week.

The working theory is that the original issue was that Redis was over taxed and couldn't respond in time.

This was resolved by splitting Redis:

https://gitlab.com/gitlab-com/infrastructure/issues/2855

However, since the site needed to rebuild its cache, the extra load on the frontends started slowing them down. The number of web and API frontends were increased.

web: https://gitlab.com/gitlab-com/infrastructure/issues/2858 api: https://gitlab.com/gitlab-com/infrastructure/issues/2861

The last issues that was fixed; HAproxy. https://gitlab.com/gitlab-com/infrastructure/issues/2880

Since the last HAproxy change, the site has been very responsive.

Corrective actions

The following are the corrective actions that were opened as a direct result of diagnosing and working through the site issues during this week:

https://gitlab.com/gitlab-com/infrastructure/issues/2895 - add cpu alerts
https://gitlab.com/gitlab-com/infrastructure/issues/2889 - improve terraform load balancing management
https://gitlab.com/gitlab-com/infrastructure/issues/2860 - better optics/alarms for redis failovers
https://gitlab.com/gitlab-com/infrastructure/issues/2859 - sentry went down when redis was down
https://gitlab.com/gitlab-com/infrastructure/issues/2861 - increase the api fleet capacity (done)
https://gitlab.com/gitlab-com/infrastructure/issues/2894 - increase the pullmirror capacity (done)

Edited Oct 02, 2017 by John Jarvis

Admin message

Admin message

Site degradation week 25/09/2017 to 28/09/2017

Corrective actions