How to implement the perfect failover strategy using Amazon Route53
Using Route53 Health Checks to fail fast
Amazon Route53 is the go-to option for DNS if you’re using AWS. Many users, including us, have only scratched the surface with the power that Route53 can provide — it’s much more than a simple nameserver.
Every application should run in at least two regions for high-availability and fault-tolerance. This should also be an active-active architecture, rather than active-passive. In an active-active system, all regions are used at the same time, whereas in active-passive a back-up region is only used in a failure scenario.
So, let’s draw an architecture diagram to show what we intend to achieve.
We have our application running in any number of AWS regions, with a load balancer as the entry-point. We want Route53 to work as a kind of “pre-load balancer” to distribute requests and point users to the correct region.
Route53 Routing Policies
We have two (or more) regions running in active-active. Great! But how can Route53 distribute the load between these regions? Enter, Route53 Routing Policies. AWS offers many different policies:
- Simple routing policy: Point a domain to a single, simple resource.
- Failover routing policy: Designed for active-passive failover. Uses a simple policy unless it’s unhealthy, where it’ll use the backup.
- Geolocation routing policy: Route traffic based on the country or continent of your users.
- Geoproximity routing policy: Route traffic based on the physical distance between the region and your users.
- Latency routing policy: Route traffic to the AWS region that provides the best latency.
- Multivalue answer routing policy: Respond to DNS queries with up to eight healthy records selected at random.
- Weighted routing policy: Route traffic to resources in proportions that you specify.
For an active-active system, Geolocation, Geoproximity, Latency, Multivalue, or Weighted policies would work. So which one should we choose?
So long as our application scales horizontally, latency routing will provide the best experience for users. DNS queries will return the lowest-latency healthy region based on the users’ IP address. Geoproximity routing will have higher latencies as it only takes physical distances into account. Multivalue routing can be used to slightly improve availability and add some basic load-balancing, but DNS load-balancing isn’t reliable and another policy is almost always better. Weighted routing is great for testing new versions and allows blue-green deployments. If there are geographic requirements, for example, users in the UK need to be routed to a region in the UK, geolocation routing is a viable option.
The simple routing policy is the only one that doesn’t support Route53 Health Checks. So, if an application is unable to use latency routing, we can still implement a great failover strategy.
The application we’re building has no geographic requirements, so we can use latency routing. So how do we set it up?
To use latency-based routing, a record set needs to be created for each region the application is hosted in. Here’s how to set up basic latency routing using Terraform:
We were quickly able to verify that Route53 would direct users to the best region based on their latency. Great! We’re all set up, so let’s run some load tests.
There’s no need to load test Route53 itself — it handles some of the largest sites in the world (Instagram, Amazon, Netflix). However, we wanted to prove that latency routing itself would work as expected with millions of requests. Specifically, we wanted to test failover scenarios to see how Route53 would direct traffic.
We’re testing our WebSocket service. WebSocket services are incredibly sensitive to instability due to the nature of persistent connections — a single user is connected to a single host in a single region for a long time. If a host (or region) fails, all users connected to that host need to reconnect. This sensitivity means it’s a great candidate for testing Route53 routing as if any connections are dropped, we know something has gone wrong.
After verifying our service could handle millions of concurrent open connections, we moved on to the failover tests.
We decided to run a load test of 3 million connections, with
eu-central-1 limited to ~2 million connections. After this, the hosts would become overloaded. They’d serve requests with increased latency, and ultimately become unhealthy. To keep things simple, we’re running in just two AWS regions. Here’s what we expected to happen:
eu-central-1hits about 2m open connections
eu-central-1becomes overloaded and starts serving requests slowly
- Route53 detects this increased latency and starts routing new requests to
- Existing connections to
eu-central-1are kept open as, without the incoming requests, CPU usage is reduced
us-east-1handles the remaining 1m connections, reaching a total of 3m
Here’s how we expected the number of open connections to change during the load test.
However, when we ran the load test, the results were nothing like what we expected.
eu-central-1 hit capacity, Route53 continued to direct requests to it. In fact, it kept on sending requests until every host in
eu-central-1 was detected as unhealthy by our load balancer health checks and shut down, despite the fact that latency had increased. All users were disconnected from the service before
us-east-1 started receiving requests. We’d expected users to be routed to
us-east-1 as soon as
eu-central-1’s latency increased. How could it go so wrong?
Latency-based routing is the approximate latency between the user and the AWS region — NOT the user and your service!
For failover to occur, the whole region must be considered unhealthy, by which time users will have experienced degraded performance.
Why does this happen? Route53 will continue to route requests until the region is detected as unhealthy, regardless of service latency. In our case, that only happened when all hosts in our load balancer were unhealthy because
evaluate_target_health was set to
true. The results would have been even worse without evaluating the target health,
eu-central-1 would never have been detected as unhealthy by Route53, causing a major outage.
Route53 provides a fantastic solution to this problem. Health checks allow DNS failover to be triggered by CloudWatch alarms. This means that we have full control over when Route53 fails over to another region — we can fine-tune it to our needs.
Health checks can also be configured to monitor an endpoint by polling it, but your service could be performing badly and still respond to the health check requests.
After adding the new health check, we re-ran the same load test again, this time with much better results.
The graph shows a small difference compared to our expectations. We quickly worked out that this is due to DNS TTL. Route53 alias records which point to load balancers will always have a TTL of 60 seconds, so it takes around that time before users start being directed to
us-east-1. Given that we’re simulating an unlikely failure scenario, we decided that a delay of 60 seconds is acceptable.
Unfortunately, whilst configuring the health checks, we found two major limitations:
No support for percentiles
Measuring latency without using percentiles is basically useless. We initially wanted to have a health check monitoring P99 connection latency and fail over if it becomes too high. Unfortunately, Route53 cannot use CloudWatch alarms with extended statistics so we can only trigger a failover based on minimum, average, or maximum latency.
No support for expressions
Expressions allow us to perform calculations on metrics, which means we can have CloudWatch alarms that trigger when the error rate exceeds a certain percentage, rather than simply monitoring the total number of errors.
When attempting to create a Route53 Health Check which is triggered by a CloudWatch alarm using expressions, the AWS API returns an internal server error.
Calculated Health Checks
Route53 allows you to failover on any number of health checks. By creating a calculated — or parent — health check, we can fail on any number of child health checks. You can also set a threshold of the number of child health checks which can fail before considering the parent unhealthy. We’ve expanded on a simple latency alarm, and configured our service to failover if any of these conditions are met:
- CPU is critically high
- High number of 5xx errors
- High number of target connection errors
- Maximum latency is high
For a single region, the hierarchy looks as follows:
The health check is configured to fail if any of the alarms are triggered–it’s important for us to fail fast. We’d prefer to have a region incorrectly or prematurely detected as unhealthy than to have downtime in that region. If all records are detected as unhealthy, Route53 will revert to use all of the records — so your service won’t stop serving requests.
Hopefully, AWS will add support for percentile and expression CloudWatch alarms soon. Once that’s completed, we can change our health checks to suit our requirements perfectly:
- CPU is critically high
- High % of 5xx errors
- High % of target connection errors
- High P99 latency
There’s also potential to use custom CloudWatch metrics to fail on more conditions — allowing us to fail even faster.
So, we can implement a great failover strategy using Amazon Route53 today but, in the future, we’ll be able to achieve the perfect strategy.
Are you using Route53 Health Checks? We’d love to hear your thoughts in the comments.