Staging web01 and web02: 100% CPU load by unicorns

Staging was reporting down for few days, returning 50x, so after today's deploy we have figured out the reason, and it turned out to be prometheus metrics. After setting prometheus_metrics_enabled to f in database and restarting unicorns the load went back to normal and staging is again with us, alive and well.

Note that his didn't prevent a deployment, its just that with 100% cpu load the service was usable for very limited time after unicorn restart, after which unicorn processes were killed for exceeding timeout and server started returning 50x all the time. Thanks @stanhu for pointing out the reason for increased load.

Start of slack discussion: https://gitlab.slack.com/archives/C101F3796/p1503958498000231

Here's the web01.stg host stats before and after disabling the metrics: https://performance.gitlab.net/dashboard/db/host-stats?orgId=1&var-node=web-01.sv.stg.gitlab.com&from=1503962155504&to=1503965640563

/cc @gl-infra @jivanvl @bjk-gitlab

Edited Aug 29, 2017 by Ilya Frolov

Admin message

Admin message

Staging web01 and web02: 100% CPU load by unicorns