Do I Need a Multi-Region AWS Deployment?

Ed Eastwood
Version 1
Published in
5 min readApr 5, 2022

Every so often something goes wrong in AWS. It hosts big-name services like Netflix and Disney+, so when it breaks it makes the news. The next morning people in IT start wondering whether their own high availability arrangements are quite highly available enough. The answer depends on the service you’re providing, but if you’re not sure how your current architecture measures up against requirements it’s definitely worth some thought. You may be over-provisioned and be able to save some cash, you might be running at risk and need to make changes, or you might have it just right. Even if it is just right, being able to show your workings to demonstrate an informed decision will be helpful next time questions are asked.

This article reviews published AWS incidents that have affected availability and explore what their impacts would have been on; single availability zone, multi-availability zone and multi-region systems. It’s business-focused so not particularly technical.

Availability Zones and Regions

An availability zone (AZ) is kind of like a data centre, but in reality, might be several: there are redundant power supplies and network connectivity but there are still single points of failure. Latency is low within an AZ as it would be within a data centre.

A region is a physical location that contains at least three availability zones. The majority of AWS services are regional, which means that they’re administered at a regional level so presumably share a common management plane behind the scenes. There’s lower latency within a region than between regions and data transfer between regions is often chargeable.

The approximation of availability zone to data centre has led organisations to multi-AZ deployments if high availability is needed. “We had two data centres before we moved to the cloud so we need two availability zones”. Multi-region didn’t really have an equivalent on-premise. There might have been data centres in different continents, but would there be distinct management and operational infrastructure? Probably not. As a result, the conversation stopped there. A precedent was established and greenfield deployments followed the same pattern.

Multi-AZ is the de-facto enterprise deployment pattern. It’s also the right one for most use cases, but it’s worth considering all of the options. Now might be a good time to stop and re-evaluate performance against availability requirements. Maybe even think about reducing costs if solutions are over-engineered? Next time there’s a catastrophic failure in the region hosting your services, at least you’ll know you made an informed decision based on cost and risk.

Summary of Events

AWS publish a summary and root cause analysis of major service-affecting events on their post-event summary page. Not all major incidents are published here, for example, only one of the three incidents reported in December 2021 is listed. Maybe these are still being investigated or didn’t meet AWS’s criteria for reporting. The data available is sufficient to give an order of magnitude estimate, but access to complete data is another challenge in predicting availability.

Here’s a review of all the incidents reported, together with their likely impact on regional or availability zone availability — column two shows whether regional availability was affected, and column three shows the number of availability zones affected. Regional outages are highlighted.

The cause of each event is likely to have been addressed, reducing failures over time. This may be offset by innovative new features bringing unexpected problems.

So, What Availability Does AWS Offer?

My initial intent was to work out what availability has been achieved over the last few years, for example, a generic system based on core AWS services. After investigating for a while, I realised how complicated this is and now have an appreciation of why AWS doesn’t report it. There are so many variables that, other than to provide a rough estimate, any generalisation is meaningless. Some of the complications are:

  • Every system is different and relies on different AWS services. We might have been able to assume a ‘core’ set of services to work around that: EC2, Lambda, Dynamo, RDS, etc… whilst brushing niche systems under the carpet (Quantum Ledger, Ground Station, etc.).
  • Where an AZ or regional outage was reported it’s often partial, 10% of RDS in a region, for example. We could factor the percentage into the calculation, but it often isn’t known and the impact of a partial loss will differ from implementation to implementation.
  • Failures of control plane and management services will affect different customers differently. Code Pipeline being down would be an inconvenience to some people but may be more serious if your business process relies on frequent releases.
  • Not all services are available in all regions.
  • Some regions have suffered from more failures than others.

Go on then, so what’s the number?

Hopefully, it’s clear now that this is nothing more than a finger in the air, but here we go… as of March 2022, there have been around 43 AZ failures and 7 regional failures since the post-event summary page started in 2011. There are 22 regions (excluding gov and China) and 72 AZs. That means an AZ failure can be expected once every 19 years, whereas a regional failure is a one in 35 years event.

Some notes:

  • Failure at a regional level also means the availability zones in it are down.
  • The number of regions and AZs used for estimates are based on the amount at the time of writing, not when the incident occurred.

What Options Are There?

So you want to make your service more resilient - what options do you have?

There are loads of deployment patterns to meet differing high availability and disaster recovery requirements, from single AZ deployments for paths to live environments, all the way through to active/active multi-region deployments. This AWS whitepaper is a good starting point.

Alternatively, there are Amazon partners like Version 1, that can help you, with a well-architected review or optimising your deployment.

About the Author:

Ed Eastwood is an AWS Architect here at Version 1.

--

--