Lumigo
Published in

Lumigo

Amazon Builders’ Library in focus #5: Static stability using availability zones

Next in our series on the Amazon Builders’ Library, Yan Cui picks out the key insights from the article, Static stability using availability zones, by AWS Senior Principal Engineer Becky Weiss and AWS Principal Engineer Mike Furr.

About the Amazon Builders’ Library

The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.

Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.

Static stability using availability zones

Control plane = changes to a system (e.g. adding resources) and propagating the changes.

Data plane = daily business of those resources — what it takes for them to function.

Static stability

Separate the data plane and control plane, because:

  • Data plane availability is typically more important to customers than control plane.
  • Data plane typically operates at a higher volume (often by orders of magnitude) than control plane. So it’s good to scale them on their own scaling dimensions.
  • Control plane typically has more moving parts, so it’s more likely to be impaired.

The data plane usually receives data from the control plane but maintains its own state so it can continue working even when the control plane is impaired.

One lesson Amazon learned is to expect impairments before they happen. A statically stable service would continue to function in the face of partial impairment (e.g. losing an AZ) or impairment to its dependencies.

Reacting to impairments as they happen (e.g. if one AZ fails other AZs would scale up to take over the load) is less effective because the response to impairment requires actions from the control plane. Control planes are typically more complex and more likely to misbehave when the overall system is impaired. A statically stable service would over-provision to the point where it doesn’t need to launch any EC2 instances even if one AZ is impaired.

Static stability patterns

  • Active-active on Availability Zones — deploy a service in an Auto Scaling Group and load balance across three or more AZs. Each AZ is over-provisioned so that if an AZ is impaired the rest of the AZs can still carry the load without needing to scale.
  • Active-standby on Availability Zones — some services are stateful and require a leader node to coordinate the work. All the writes go to the master and replicated to a standby node in another AZ. In the case of RDS, the failover would be handled automatically by RDS.

Under the hood: Static stability inside of Amazon EC2

The rest of the article then goes deeper into how static availability is applied in EC2:

  • Deployments follow a zonal deployment calendar: deploy to AZs in the same region on different days.
  • Network traffic is kept local to the AZ.

You can use the aforementioned active-active pattern to build highly available regional services. You can then stack these services on top of each other. This regional-calls-regional pattern is one Amazon uses for many of its services — both external-facing as well as internal.

But for foundational services — services that are building blocks for other services such as EC2 — Amazon designs them to be AZ independent instead.

This is why EC2 NAT Gateway is a zonal resource. AZ independence is important here because NAT Gateway sits in the path of internet connectivity and is, therefore, part of the data plane for any EC2 instance in the VPC.

To allow customers to build highly available regional services, Amazon needs to ensure AZ impairments are contained and do not spread out to other AZs. Which is why all foundational components such as NAT Gateway needs to stay within an AZ.

The tradeoff for this design decision is the additional complexity involved in managing zonal (rather than regional) service configurations. E.g. multiple NAT gateways and routing tables.

Amazon also periodically stores database backups in S3 and keep read replicas across multiple AZs. This is to ensure they store customer or business-critical data durably.

Read parts 1–4 of the Amazon Builders’ Library in Focus series:

Originally published at https://lumigo.io on January 23, 2020.

--

--

--

Monitor & Debug Serverless Applications

Recommended from Medium

How to run Dart code on Google Cloud IDE

How to run Dart code on Google Cloud IDE

NVMe Dedicated Server Deals

Rewriting every i18n key in a large frontend codebase

Executing a Pentest — A Couple of Tips About How to Avoid Geeting Ripped Off

Remote Procedure Call (RPC) in Action….

I’m afraid you’re thinking about AWS Lambda cold starts all wrong

How to monitor Lambda with CloudWatch Metrics

How good is Podman?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yan Cui

Yan Cui

AWS Serverless Hero. Independent Consultant https://theburningmonk.com/hire-me. Author of https://productionreadyserverless.com. Speaker. Trainer. Blogger.

More from Medium

Simulating AWS environment locally with AWS Localstack

Web log analysis on AWS. Up and running in minutes!

Automate DevOps Workflows using AWS StepFunctions Service Integrations

AWS — Difference between Amazon EventBridge and Amazon SNS