How to implement the perfect failover strategy using Amazon Route53

Using Route53 Health Checks to fail fast

Simon Tabor
Jun 12 · 7 min read

The setup

Every application should run in at least two regions for high-availability and fault-tolerance. This should also be an active-active architecture, rather than active-passive. In an active-active system, all regions are used at the same time, whereas in active-passive a back-up region is only used in a failure scenario.

Route53 routing requests across any number of load balancers in different AWS regions

Route53 Routing Policies

We have two (or more) regions running in active-active. Great! But how can Route53 distribute the load between these regions? Enter, Route53 Routing Policies. AWS offers many different policies:

  • Failover routing policy: Designed for active-passive failover. Uses a simple policy unless it’s unhealthy, where it’ll use the backup.
  • Geolocation routing policy: Route traffic based on the country or continent of your users.
  • Geoproximity routing policy: Route traffic based on the physical distance between the region and your users.
  • Latency routing policy: Route traffic to the AWS region that provides the best latency.
  • Multivalue answer routing policy: Respond to DNS queries with up to eight healthy records selected at random.
  • Weighted routing policy: Route traffic to resources in proportions that you specify.

Latency-based routing

The application we’re building has no geographic requirements, so we can use latency routing. So how do we set it up?

Example of Terraform for latency-routing across load-balancers in two regions

Load tests

There’s no need to load test Route53 itself — it handles some of the largest sites in the world (Instagram, Amazon, Netflix). However, we wanted to prove that latency routing itself would work as expected with millions of requests. Specifically, we wanted to test failover scenarios to see how Route53 would direct traffic.

  • eu-central-1 becomes overloaded and starts serving requests slowly
  • Route53 detects this increased latency and starts routing new requests to us-east-1
  • Existing connections to eu-central-1 are kept open as, without the incoming requests, CPU usage is reduced
  • us-east-1 handles the remaining 1m connections, reaching a total of 3m
Graph showing the expected number of open connections during our load test
Graph showing the actual number of open connections during our load test

The solution

Route53 provides a fantastic solution to this problem. Health checks allow DNS failover to be triggered by CloudWatch alarms. This means that we have full control over when Route53 fails over to another region — we can fine-tune it to our needs.

Terraform code provisioning a simple Route53 Health Check for a CloudWatch metric
Graph showing the actual number of open connections during our load test after adding CloudWatch health checks

Limitations

Unfortunately, whilst configuring the health checks, we found two major limitations:

No support for percentiles

Measuring latency without using percentiles is basically useless. We initially wanted to have a health check monitoring P99 connection latency and fail over if it becomes too high. Unfortunately, Route53 cannot use CloudWatch alarms with extended statistics so we can only trigger a failover based on minimum, average, or maximum latency.

No support for expressions

Expressions allow us to perform calculations on metrics, which means we can have CloudWatch alarms that trigger when the error rate exceeds a certain percentage, rather than simply monitoring the total number of errors.

Terraform code showing a CloudWatch alarm metric expression

Calculated Health Checks

Route53 allows you to failover on any number of health checks. By creating a calculated — or parent — health check, we can fail on any number of child health checks. You can also set a threshold of the number of child health checks which can fail before considering the parent unhealthy. We’ve expanded on a simple latency alarm, and configured our service to failover if any of these conditions are met:

  • High number of 5xx errors
  • High number of target connection errors
  • Maximum latency is high

Conclusions

Hopefully, AWS will add support for percentile and expression CloudWatch alarms soon. Once that’s completed, we can change our health checks to suit our requirements perfectly:

  • High % of 5xx errors
  • High % of target connection errors
  • High P99 latency

DAZN Engineering

Revolutionising the sport industry

Thanks to Yan Cui.

Simon Tabor

Written by

DAZN Engineering

Revolutionising the sport industry