Degraded Performance on GitLab.com 2016-08-12
At around 10:40 CDT (16:40 UTC) alerts began to go off indicating that 15 of the worker nodes' sidekiq processes had gone away. After a long investigation process, it became evident that the 15 workers could not access Postgres via the Azure load balancer IP. In order to work around this issue, we updated the Chef attributes to set the GitLab to talk directly to DB4. This resolved the degraded performance.
We currently have a ticket open with Azure to troubleshoot why the load balancer is dropping traffic from some worker nodes and not others.
The workers that CAN access the database via the load balancer ip are:
- worker3
- worker10
- worker11
- worker12
- worker13
The workers that CANNOT access the database via the load balancer are:
- worker1
- worker2
- worker4
- worker5
- worker6
- worker7
- worker8
- worker9
- worker14
- worker15
- worker16
- worker17
- worker18
- worker19
- worker20