Azure Availability Sets vs Availability Zones vs Paired-regions!!

local, zonal & regional redundancy…

Sreeram Garlapati
6 min readMay 21, 2020

What is Cloud!? Azure or AWS or GCP — at the core of it, Cloud is nothing but limitless computing infrastructure!

Over the course of time, any infrastructure — is bound to fail. Drive’s could be damaged. Cooling equipment, like fans, could break down. Power cables or network cables could burn out. The datacenter, housing the infrastructure, could loose power supply. Natural Disasters could take down an entire Datacenter or could wipe out a city from the map. Possibilities on types of failures are endless!

Now, customers of Azure Cloud, relying on this infrastructure, despite all possible failures, how do they guarantee that their data will not be lost? How do they ensure that their services will be up-and-running when these failures happen?

How can the Azure Customers be certain that the IT infrastructure powering their business will run forever (of course, NOT in the case when the entire Earth is wiped out from Universe!!)!?

Answer: Azure categorized all these failures into 3 buckets. They then created REDUNDANCY in infrastructure in such a way that, customers can run their systems spanning these redundant layers. To facilitate that, this redundancy is exposed as programmable knobs to customers.

Microsoft Azure provides 3 infrastructure-level knobs of redundancy and there by, High-Availability to safeguard against these 3 different levels of failures — intra-datacenter, inter-datacenter and inter-regional. When an Azure region is built out, these knobs are natively baked into the infrastructure design. They are:

  1. Availability Sets — for intra-datacenter failures
  2. Availability Zones — for inter-datacenter failures
  3. Paired Regions — for inter-regional failures

This blog narrates STORIES about a fake Customer of Azure — “EINSTEIN BANK” — to explain how the bank used these high availability knobs and defied the ODDS & managed to STAY UP-AND-RUNNING during outages!

1. RACK FAILURE

Hypothetically, this bank hosted the all the banking software on a datacenter in NY in AZURE. Due to a RARE cooling problem — one of the RACKs in the NY datacenter was BURNT. Unfortunately, some of the infrastructure of EINSTEIN bank is on the SAME RACK, where the failure happened!

Around the same time, several customers across the world are trying to access their accounts. But, their Data is present on this BURNT RACK!!

Despite the RACK failure, all of the customers of EINSTEIN bank are able to access their accounts Normally!

No IT Engineer of EINSTEIN BANK was involved.

None of the Customers saw any outage! Isn’t that AWESOME!

This is because, EINSTEIN BANK leveraged the Azure feature — Availability Sets!

Availability Sets safeguards from failures happening within a datacenter.

PAAS (Platform As A Service) services like, VMSS, ServiceFabric etc expose this knob in the form of Fault Domains &

Data services (like Storage, Event Hubs, SQL etc) hooks into Availability Sets in the form of replication (for ex: LRS — locally redundant storage).

2. DATACENTER FAILURES

Now, in the course of time, just like every other THING that could go wrong, the power supply to Data Center was busted. The entire Data Center went down in NY!

Several customers who are accessing their Account Information are trying to hit this Data Center!!

None of the IT engineers from EINSTEIN BANK lost their sleep — trying to resolve the issue.

In fact, the BANK or none of the Customers doesn't even notice any outage! HOW!!

Availability Zones provides resiliency to inter-datacenter failures.

One Azure Region is made up of multiple datacenters — multiple Zones!

Each Region is made up of multiple datacenters.

& EACH DATACENTER IS BUILT USING MUTUALLY EXCLUSIVE EMERGENCY ENERGY SUPPY, COOLANT SUPPLIES FOR DATA, COMPUTE & N/W.

IAAS services like Virtual Machines exposes this in the form of pinning the VMs to Zones.

PAAS services like VMSS exposes this in the form of SPANNING ACROSS Zones.

Data services like Sql, Event Hubs, Cosmos db or Storage exposes this knob in the form of replication and typically referred as Zonal Redundancy.

3. REGION-WIDE FAILURES

Now, in the timeless eternity, one day, a big disaster happened! All Data Centers in then entire NY region were struck by that disaster and collapsed. No data was recoverable from these data centers. Every thing is LOST!

Several requests from customers to access their account are landing on these datacenters.

There is no way for customers to reach their accounts. All infrastructure which holds information about the customers accounts are hosted in these DataCenters!

While building their banking solution, EINSTEIN BANK used AZURE PAIRED-REGIONS. As part of leveraging Azure Paired-Regions support — EINSTEIN BANK deployed their banking solution — across 2 Azure Regions (the region pair). Since, only one of those regions went down and the data and infrastructure are replicated to the Paired-region, EINSTEIN BANK manually failed-over to the other region & recovered from the failure.

Since, these paired regions are typically several miles apart — N/W latency between these regions is high.

Storage services like Sql, Storage, Cosmos db, Event Hubs etc. supports the notion of Paired regions in the form of Geo-Redundancy flag. Due to higher n/w latency data replication is asynchronous and is characterized by a small delay (known as Recovery Time Objective).

Overall, Microsoft Azure is building out regions in such a way that, customers have these 3 redundancy knobs across almost all of the Azure Services, to empower customers to handle most of the infrastructure failures. Based on the business type, for ex: local business might be happy with Availability Zones & much serious customers like international banks will need Geo-Pairs — customers can opt into the right knob and plan for business continuity during these failures.

--

--