Azure decreasing business impact
This morning nfs-file01
was shut down for an issue between the hypervisor and the VHD storage (this means the root filesystem was getting in an inconsistent state). This is perfectly legit since it's a cloud server and it's our responsibility to make this SPOF redundant.
However, Azure didn't inform us in any way, not before nor after the even. I had to create an issue with critical business impact (GitLab.com was down during the outage) and ask them to tell us what happened.
At first I received a call from a Azure support tech asking how I knew that the server got rebooted and it wasn't just unreachable through the network (duh!). When I explained that running uptime
reported 0 minutes he hung up saying he'll investigate further. Then their RCA came in:
We identified that your VM became unavailable at 8:20:29 (UTC) and availability was restored at 8:29:00 (UTC). This unexpected occurrence was caused by an Azure initiated temporary VM shutdown.
The temporary VM shutdown was triggered by our Azure monitoring systems detecting failed IO transaction between the physical host node where your VM was running, and the Azure Storage services where your VHDs reside. As designed, this action was taken to preserve data integrity of your VM. Once the node detected that conditions had improved, the VM was restarted. RDP connections to the VM, or requests to any other services running inside the VM may have failed during this time.
To ensure an increased level of protection and redundancy for your application in Azure, it is recommended that you group two or more virtual machines in an availability set.
Then I logged into the Azure support panel to find out they changed the business impact to B meaning "moderate impact":
I'm not sure on which basis they decided to do so. Should we open another issue to ask about this or should we keep updating the existing one?