Amazon Builders’ Library in focus #5: Static stability using availability zones
Next in our series on the Amazon Builders’ Library, Yan Cui picks out the key insights from the article, Static stability using availability zones, by AWS Senior Principal Engineer Becky Weiss and AWS Principal Engineer Mike Furr.
About the Amazon Builders’ Library
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Static stability using availability zones
Control plane = changes to a system (e.g. adding resources) and propagating the changes.
Data plane = daily business of those resources — what it takes for them to function.
Separate the data plane and control plane, because:
- Data plane availability is typically more important to customers than control plane.
- Data plane typically operates at a higher volume (often by orders of magnitude) than control plane. So it’s good to scale them on their own scaling dimensions.
- Control plane typically has more moving parts, so it’s more likely to be impaired.
The data plane usually receives data from the control plane but maintains its own state so it can continue working even when the control plane is impaired.
One lesson Amazon learned is to expect impairments before they happen. A statically stable service would continue to function in the face of partial impairment (e.g. losing an AZ) or impairment to its dependencies.
Reacting to impairments as they happen (e.g. if one AZ fails other AZs would scale up to take over the load) is less effective because the response to impairment requires actions from the control plane. Control planes are typically more complex and more likely to misbehave when the overall system is impaired. A statically stable service would over-provision to the point where it doesn’t need to launch any EC2 instances even if one AZ is impaired.
Static stability patterns
- Active-active on Availability Zones — deploy a service in an Auto Scaling Group and load balance across three or more AZs. Each AZ is over-provisioned so that if an AZ is impaired the rest of the AZs can still carry the load without needing to scale.
- Active-standby on Availability Zones — some services are stateful and require a leader node to coordinate the work. All the writes go to the master and replicated to a standby node in another AZ. In the case of RDS, the failover would be handled automatically by RDS.
Under the hood: Static stability inside of Amazon EC2
The rest of the article then goes deeper into how static availability is applied in EC2:
- Deployments follow a zonal deployment calendar: deploy to AZs in the same region on different days.
- Network traffic is kept local to the AZ.
You can use the aforementioned active-active pattern to build highly available regional services. You can then stack these services on top of each other. This regional-calls-regional pattern is one Amazon uses for many of its services — both external-facing as well as internal.
But for foundational services — services that are building blocks for other services such as EC2 — Amazon designs them to be AZ independent instead.
This is why EC2 NAT Gateway is a zonal resource. AZ independence is important here because NAT Gateway sits in the path of internet connectivity and is, therefore, part of the data plane for any EC2 instance in the VPC.
To allow customers to build highly available regional services, Amazon needs to ensure AZ impairments are contained and do not spread out to other AZs. Which is why all foundational components such as NAT Gateway needs to stay within an AZ.
The tradeoff for this design decision is the additional complexity involved in managing zonal (rather than regional) service configurations. E.g. multiple NAT gateways and routing tables.
Amazon also periodically stores database backups in S3 and keep read replicas across multiple AZs. This is to ensure they store customer or business-critical data durably.
Read parts 1–4 of the Amazon Builders’ Library in Focus series:
Originally published at https://lumigo.io on January 23, 2020.