Staging web01 and web02: 100% CPU load by unicorns
Staging was reporting down for few days, returning 50x, so after today's deploy we have figured out the reason, and it turned out to be prometheus metrics. After setting prometheus_metrics_enabled
to f
in database and restarting unicorns the load went back to normal and staging is again with us, alive and well.
Note that his didn't prevent a deployment, its just that with 100% cpu load the service was usable for very limited time after unicorn restart, after which unicorn processes were killed for exceeding timeout and server started returning 50x all the time. Thanks @stanhu for pointing out the reason for increased load.
Start of slack discussion: https://gitlab.slack.com/archives/C101F3796/p1503958498000231
Here's the web01.stg
host stats before and after disabling the metrics: https://performance.gitlab.net/dashboard/db/host-stats?orgId=1&var-node=web-01.sv.stg.gitlab.com&from=1503962155504&to=1503965640563