Azure decreasing business impact

We should continue to work the issue in this issue from them.

In the issue ticket Azure specified that they initiated a move of the host, and the reboot is just the consequence of that action. Please push them HARD as to the why, and make them document it to us. We are using the Ubuntu image that comes FROM them, modified BY THEM to run on Azure, with THEIR Azure VM agent on every host. Why can't they migrate machines smoothly given that?

Found the reason between the spam at the bottom of the email:

As RCA has been identified, I will low the severity of this case to B now, if you have any questions, please feel free to let me know, thank you!

Like, we have found the cause so this is less important now.

I just replied:

Thanks for the RCA.

Just so that you are aware, this event caused GitLab.com to go down for 35 minutes. Many companies and users were impacted.

I have a few questions:

1- Why didn't we get notified before the event? From the RCA I get that it degraded in a rather graceful way so you could've pre-warned us.

2- Why didn't we get notified when this happened? I spent 10-15 minutes trying to understand what was going on, time that could have been spared if you just notified us.

3- Why did you downgrade the business impact to moderate impact? I do not consider GitLab.com going down as "moderate".

4- Why did the automation have to shut down the instance? We're using the Azure customised Ubuntu image, running the Azure agent. Why can't you smoothly migrate the instance to a healthy hypervisor instead of powering it off? What's the purpose of all your customisations then?

Please reply to each.

If you have any further questions I'll forward to them in the issue.

added outage label

Related https://gitlab.com/gitlab-com/infrastructure/issues/1633#note_27749496

made the issue visible to everyone

Keeping track of communication with Azure here so I can get clarification from our team:

GL: Why didn't we get notified before the event? From the RCA I get that it degraded in a rather graceful way so you could've pre-warned us.
- AZ: The first incident is related with our storage hardware fault and this event is unplanned, once the issue is detected, we initialized an auto-recovery process and bring the VM up. For second incident, we are still in investigation and we shall provide more detailed information later.
- EvN "once the issue is detected" --> they should be able to alert us. Indeed subsequent answer in point 2 indicates that there is a status dashboard. I'll ask more about that.
GL: Why didn't we get notified when this happened? I spent 10-15 minutes trying to understand what was going on, time that could have been spared if you just notified us.
- AZ: As mentioned, this event is an unplanned event, so that’s why no notification get sent out.
- GL: Why didn't we receive any kind of notification after you shut down the instance? As an analogy with physical data centres, this is on the same level as someone stumbling on the power cable while walking in the server room and plugging it back in without telling anyone. If you perform any unattended operations with our instances at the very least let us know that you did soon after this happens.
- AZ: From the technical part, the issue happened yesterday is "Service Healing", it is not done by someone intentionally, it is related with a hardware failure, azure detect this failure and automatically recovered the VM. For more technical details about service healing, you can refer to below article. https://azure.microsoft.com/en-us/blog/service-healing-auto-recovery-of-virtual-machines/. Normally, when this kind of issue happened, when you logon to Azure portal, you should see a banner from Azure portal and mentioned that you were affected. Azure also provided "Resource health" to provide health status about your VM, refer to below article https://docs.microsoft.com/en-us/azure/resource-health/resource-health-overview. Recently, we also provide Azure app(preview) so that you can receive corresponding notifications and alerts.
GL: Why did you downgrade the business impact to moderate impact? I do not consider GitLab.com going down as "moderate".
- AZ: Refer to below https://support.microsoft.com/en-us/gp/azuresevdetails. [Paraphrasing rest of response] "When you opened an issue it was to ask for a root cause, not to inform us that you had significant loss or degradation. Even so we treated it as Severity A until the root cause was provided".
GL: Why did the automation have to shut down the instance? We're using the Azure customised Ubuntu image, running the Azure agent. Why can't you smoothly migrate the instance to a healthy hypervisor instead of powering it off? What's the purpose of all your customisations then?
- AZ: It is an in-placement recovery, the problem is at backend storage, so moving to a healthy hypervisor won’t fix this problem.
GL: Why didn't you receive a notification about the message I sent yesterday through the Azure support panel? I clicked on "new message" and started typing. That seemed the only way to reply from there. Are email threads the only way to interact with support? If yes, what exactly is the purpose of the support website?
- AZ: Thanks for bringing this issue up. Yes, there is a tool issue here that is no notification will be sent out if you use the web reply option. The message indeed will get logged into our case management system, however, there is no notification will send out, if you prefer to use the web response option, please do let me know, I will constantly refresh our internal case management tool and check if there is new message comes in.

Would appreciate your insights @omame, @northrup on what else we can probe on here reasonably. Certainly the alerting seems subpar at this point, can we configure an alerting setting instead of having to go through the portal to view their dashboard?

I'll find the other issue relating to this outage to see what we can do beyond relying on Azure's availability.

removed keep confidential label

mentioned in issue #1656 (closed)

mentioned in issue #1633 (closed)

The only think I can think of is that we should immediately receive an alert when an NFS node becomes unavailable. If my memory doesn't betray me (which is likely) I think this didn't happen so I wasted some minutes tracing down the issue.

The only alerts we got were GitLab.com being down. Maybe there's an overly inclusive alerts aggregation in place?

Regarding the alerts, @pcarranza already added those alerts following the outage here: https://gitlab.com/gitlab-com/runbooks/commit/6ecebe703882c8ee66ceebdf3aa2de3e1c031018

Update, via correspondence:

Ernst wrote:

Thanks for the additional information. We are keen to have the notification service “feature” deployed. Regarding the severity status, the question is really what is Microsoft Azure doing to address whatever the root cause was of the failure on the NFS that led to the self-healing reboot. Essentially until that root cause is addressed, this can happen again and cause an outage of GitLab.com. What information can you share on what the root cause was there and what is being done to correct for it?

Azure responded:

Thanks for your e-mail. Currently Microsoft Cloud Infrastructure and Operations team is still investigating this issue, we should have the RCA information soon and I will provide once I got any information.

I'll update here is / when relevant. But until then, we should focus our efforts on #1656 (closed) . Closing for now.

closed

mentioned in issue #1620 (closed)

mentioned in issue #1690 (closed)

Azure decreasing business impact

Designs

Child items ...

Activity

Admin message

Admin message

Azure decreasing business impact

Activity