Automatic Multi-Region API Failover and Geo-proximity Routing on AWS

Published in

FTS Engineering

6 min readJan 3, 2018

We can all agree that resiliency and reliability are of huge importance when designing and building a service. If you’re unable to send a selfie it can be a minor annoyance, but what if you’re unable to execute a trade? Download a research report? Access email? It doesn’t matter the industry or use case, it’s critical to have service availability.

Misconfigurations, software bugs, network outages, file storage being unavailable. All it takes is one and you could be looking at partial or complete downtime. The clear solution for this issue is redundancy; having more than one instance of your service being available at the same time.

The second piece of the puzzle is how to utilize that backup system. If you have to wait until somebody realizes there’s an issue, apply a configuration fix, and wait for it to propagate, then your downtime can be upwards of 10 minutes (if you’re quick on the keyboard). Obviously, this is an operational nightmare, and delivers inadequate results at the same time. The answer to this issue is having an automatic failover. Depending on your sensitivity to downtime (milliseconds? seconds? minutes?), you could opt for a load balancer or proxy with health checks, or a DNS based option.

At FinTech Studios, our REST APIs are used by direct API customers as well as any users of our web, mobile, and desktop applications and their variants. Clearly, this is a service that requires the highest uptime possible.

Our solution for this was simple: create clusters of API instances across several geographic regions. Our API instances are written in Node.js, and are easily scaled in relation to incoming traffic using a service like AWS Elastic Beanstalk or similar. We also decided to provision each region with its own database services that are clustered with each other for database-level redundancy. The end result is that we essentially have a fully standalone system in each geographic region we deploy to. Layer on some DNS routing magic, and you now have automated failover and active-active load balancing with a single API domain name endpoint.

This was something we were excited to get up and running, until we ran into a few implementation issues.

Being a primarily AWS shop, we leverage the API Gateway service as the external face to our API services. As some of you may know, API Gateway uses CloudFront behind the scenes. You may also remember that CloudFront does not allow you to re-use domain names across distributions (I’ll touch on the Nov 2017 updated shortly). What’s the big issue with that? Well, say we want api.mysite.com to be dynamically resolved to api-apac.mysite.com or api-us.mysite.com depending on the client’s geographic location and the health of the APAC and US API instances.

Prior to the Nov 2017 CloudFront/API Gateway updates, these were some of your working and non-working options:

Option 1: Create a CloudFront distribution in each of the geographic distributions with the same domain name. Nope, that breaks the “no domain repeats” rule for CloudFront.

Option 2: Put a CloudFront distribution in front of the API Gateway. Sorta. That could technically work, but has two issues: a redundant CloudFront distribution in the routing path, and CloudFront does not have automatic failover.

Option 3: Have a domain name for each region, and select one in the API client. Ew. That sounds like a headache, and the failover would have to be done on the front end. Really defeats the purpose of what we’re trying to do.

Option 4: Set up something like HAProxy in each region as a proxy in front of the API Gateway instances. Ew again. To be truly redundant, we would need at least 2x HAProxy instances in each region in different availability zones, with a way to balance traffic between the instances. At this point, we’re almost defeating the purpose of having the API Gateway!

Until recently, the only option you had was Option 4, which involves using HAProxy to proxy traffic to the correct API Gateway behind the scenes.

Example implementation of the HAProxy approach. The DNS routes (Route 53) will resolve using geoproximity to localize to a region, and then health checks to manage healthy/unhealthy API instances. This post does a great example of going into detail on how to actually implement this architecture.

That’s a lot of moving parts! Thankfully, you don’t need to do this anymore. In November 2017, AWS announced support for Regional API Endpoints. Now we’re in business!

So now the approach will be:

Create an API Gateway in each Region.
For each API Gateway, create a custom domain name that is set to the “global” API domain you’d like to use. In our example, that would be api.example.com.
Create a Traffic Policy in Route 53 that applies rules to what determine what api.example.com is resolved to.

Let’s step through this in a bit more detail:

The first step is creating your API Gateway in each region. This is pretty straightforward, so we won’t go through it.

The next step is to create the custom domain name for that API Gateway, with a few new settings:

Navigate to the “Custom Domain Names” section in the API Gateway console.
Select “Create Custom Domain Name”
Enter api.example.com (replace with your domain name obviously) in the “Domain Name” field.
Select “Regional” for the Endpoint Configuration option. This is an important step, because it means the API Gateway will spin up a regional CloudFront instance behind the scenes. Since it’s a regional CloudFront instance, you can re-use domain names across multiple instances. This is the key piece of the puzzle!
Select an ACM certificate this is valid for api.example.com.
Add a mapping that will route requests to the proper API and API stage.
Hit “Save”.
Now do Steps 1–7 in the other region(s) you want the API to serve.

A (slightly redacted) view at how to create the custom domain name for your API Gateway instances. You’ll need to do this for all API Gateway instances you want to route traffic to.

Next, you’ll need to open Route 53 and create a Traffic Policy that will determine how api.example.com is routed to the backend API Gateway instances.

There are a few approaches for how you can do this, but we took the following approach: route to the closest API instance (determined by latency), with a fallback to another region’s API Gateway if the closest instance fails health checks. In practice, this means that clients will get the best response times, with automatic failover to a different region if something breaks. That’s exactly what we’re looking for!

In the Route 53 policy generator, create a single latency rule. Set your primary region first, and then your second region. For each region listed here, you’ll route to a Failover Rule. The Failover Rule should primarily route to the API Gateway instance in that region. In the instance of a failed health check, then it should route to the other region’s API Gateway instance. See our diagram for doing this in a simple 2-Region configuration.

And that does it! Now it’s extremely simple to provide automatic multi-region failover and routing for API Gateways on AWS.

At FTS, these changes resulted in a marked improvement for our customers; besides having improved effects on service availability, this DNS load balanced API system provided access speed enhancements as well. If you’re in Asia, you’ll now be transparently routed to an instance located in Asia, rather than being routed back to Europe or the United States.

It also had the effect of reducing our operational resources for managing all these HAProxy instances and DNS routing rules. As we continue to expand to more and more regions, we can rest easy knowing it’s a much simpler process.

Until next time!

- Kevin Barresi

Automatic Multi-Region API Failover and Geo-proximity Routing on AWS

Written by Kevin Barresi