Azure SLA for Virtual Machines

Rocco Scaramuzzi
rocco.tech
Published in
4 min readDec 31, 2019

Virtual Machines (VM) are the infrastructure as a service (IaaS) offered by Azure. You can deploy VM in Azure with different Operating Systems (Windows Server, Ubuntu Linux, and Windows Desktop) and you don’t need to manage the underlying physical server.

Why are we concerned about Azure Virtual Machine at the era of Serverless services such as Azure Function or fully managed service as App Service? In case distributed systems with micro-services like architecture, App Service or Azure Function don’t give us container orchestrator functionality. Instead, a better choice will be AKS (Azure Kubernetes Service). AKS simplifies the deployment, the management and the operations of Kubernetes. Kubernetes is an open-source container-orchestration system for automating application deployment, scaling and management. Azure Kubernetes Service makes it quick and easy to deploy and manage containerized applications without container orchestration expertise. When setting up an AKS cluster, the AKS nodes will basically be running on Azure Virtual Machine.

Azure Kubernetes Service

Because of AKS’s growing popularity, I think it’s important to understand the SLA (Service Level Agreement) provided for the VMs which will then affect the SLA of your application when hosted via AKS.

In Azure, there are different SLAs for Virtual Machine.

  • 99.9% — This is the SLA for a single instance Virtual Machine. Azure guarantees 99.9% connectivity for a single VM for all operating system disk and data disk.
  • 99.95% — Azure guarantees connectivity of at least 99.95% for all virtual machines that have two or more instances deployed in the same Availability Set.
  • 99.99% — Azure guarantees connectivity of at least 99.99% for all virtual machines that have two or more instances deployed across two or more Availability Zones in the same Azure Region.

Availability Set and Availability Zones configurations allow reducing as much as possible the VM’s downtime. A VM’s downtime can be related to reboots (planned or unplanned to patch installation) or incidents such as power supply issues, network issues, storage issues etc.

Let’s see in more details Availability Set and Availability Zones.

Availability Sets — 99.95% SLA

Azure configures Availability sets VMs in “Fault Domains” and “Update Domains”. A fault domain is essentially a rack of servers and each of them is connected to a different network and power supply. Instead, an update domain is a logical unit of deployment. The fault domain is useful to mitigate hardware failures and the updated domain to mitigate downtime during VM’s maintenance. For each region, Azure provides a maximum of 3 fault domains and a maximum of 20 update domains.

Fault Domains and Update Domains

How Azure distributes the VMs inside an Availability Set?

Let’s assume we want to deploy our application in Availability Sets of 2 VMs, one VM will be deployed against the fault domain 0, the other one inside fault domain 1. Each VM will also be assigned to a different update domain.

Availability Sets — VMs distribution

Let’s see what happens in the case of a hardware failure and a reboot due to VM maintenance from Microsoft.

Hardware failure If there is an issue with the hardware from fault domain 0, it will only impact that domain and our application will still be running because there is a replica VM running in fault domain 1.

Availability Sets — FD0 failure

VM rebootIf Microsoft needs to reboot the VMs to do a security patch installation, they reboot one update domain at a time. So they reboot all the VMs under updated domain 0, when they are all up and running, they will then reboot the VMs under Update domain 1 and so on.

So with the combination of 3 fault domains and 20 update domains, Microsoft tries to reduce as much as possible the downtime due to hardware failures or security patch installation.

Availability Zones — 99.99% SLA

With Availability Zones, we can deploy our VMs across multiple data-center of the same region. For example, if we have our application running 3 VMs, we can decide to deploy one VM in zone 1, one in zone 2 and another in zone 3.

Availability Zones

In this way, we make our application more reliable if one of the data-center goes down.

Availability Zones can be used to provide a higher level of availability to an application deployed in AKS, where the clusters can be distributed across availability zones. When the cluster components are distributed across multiple zones, your AKS cluster is able to tolerate failure in one of those zones. In this way, our applications and management operations continue to be available even if one entire datacentre has a problem.

References

--

--

Rocco Scaramuzzi
rocco.tech

Tech Lead, Technical Architect, Coder, Senior Software Engineer