Site degradation week 25/09/2017 to 28/09/2017
This issue is a collection of what was worked on during the week.
The working theory is that the original issue was that Redis was over taxed and couldn't respond in time.
This was resolved by splitting Redis:
https://gitlab.com/gitlab-com/infrastructure/issues/2855
However, since the site needed to rebuild its cache, the extra load on the frontends started slowing them down. The number of web and API frontends were increased.
web: https://gitlab.com/gitlab-com/infrastructure/issues/2858 api: https://gitlab.com/gitlab-com/infrastructure/issues/2861
The last issues that was fixed; HAproxy. https://gitlab.com/gitlab-com/infrastructure/issues/2880
Since the last HAproxy change, the site has been very responsive.
Corrective actions
The following are the corrective actions that were opened as a direct result of diagnosing and working through the site issues during this week:
- https://gitlab.com/gitlab-com/infrastructure/issues/2895 - add cpu alerts
- https://gitlab.com/gitlab-com/infrastructure/issues/2889 - improve terraform load balancing management
- https://gitlab.com/gitlab-com/infrastructure/issues/2860 - better optics/alarms for redis failovers
- https://gitlab.com/gitlab-com/infrastructure/issues/2859 - sentry went down when redis was down
- https://gitlab.com/gitlab-com/infrastructure/issues/2861 - increase the api fleet capacity (done)
- https://gitlab.com/gitlab-com/infrastructure/issues/2894 - increase the pullmirror capacity (done)