How We Set Up Cross-Region Failover

Wei Wei
Flux | Engineering
Published in
6 min readSep 30, 2021
Colorado River confluence

Why Another Region

Until recently, we deployed our system in multiple AZs (Availability Zones) in one AWS region. It was a practical and economical choice because of good ROI:

  • Decent uptime (~99.9%) and durability SLA for multiple AZs in one region from AWS. Already meets our business needs.
  • Very little complexity overhead. Most AWS services support multiple AZs with simple configurations.

But we will grow to have needs beyond the single-region setup, to name a few:

  • Better uptime, or better RTO (Recovery Time Objective). In a rare case that a whole AWS region is down for hours, or some of the services we use are down, we’d like to have other options instead of being sitting ducks.
  • Better durability. All of our data are backed up to S3, which claims an infamous 11 9's durability. However, there are two catches: 1. it doesn’t completely prevent human errors even if you turn on versioning on every bucket; 2. what if your data happens to be the “lucky” one while millions of other people’s data are still intact? In other words, “your nines are not my nines”.
  • Better RPO(Recovery Point Objective). Some of our data storage were backed up hourly/daily. If the source storage was busted without notice, how much data would we lose? It should be much smaller than an hour.

Our Solutions

After setting our goals, we determined our strategy is to build the system to be able to failover to another region, aka. disaster recovery. And we don’t need to run our system simultaneously in multiple regions as a “hot-hot” setup, as it would be overkill for our needs and take a lot more effort. It’s important to pick a strategy based on your business needs to balance return and effort.

The next step is to work on the actual move and failover method, component by component. I will go over a few main components of our system and the solutions we chose for each of them.

Front-End

Our front-end is a SPA (Single-Page Application) static site hosted on AWS S3 + CloudFront CDN. It turned out to be the easiest to be deployed cross-region. We just need to put the static files into an S3 bucket in the secondary region, configure it as a CloudFront backup origin, and turn on CloudFront origin failover. CloudFront itself is a global service. If the primary region S3 service went down, CloudFront will automatically use the secondary region S3 as the origin. We also set up a cross-region replication from the primary S3 bucket to the secondary bucket to keep the content in sync.

Web Servers

Our web servers are 100% stateless, so it is fairly easy to run them in multiple regions. They just need to be configured correctly to interact with services and resources in their own region, e.g. SNS/SQS/S3 buckets.

Web server failover will be a manual DNS change to alias a global subdomain to a regional subdomain. This old-school method has its drawbacks e.g. DNS cache, and we looked into AWS Global Accelerator. But eventually, we deemed this DNS method is very simple and good enough for our use case.

Backend Jobs

Flux also has various backend jobs, e.g. Machine Learning data pipelines, Lambdas that work like cron jobs or consume SQS/SNS messages, etc. These are similar to web servers, straightforward to be ported over to another region. We just need to make sure their configurations are correctly pointing to the secondary region.

OLTP Database

OLTP is generally the hard part to deploy globally, as it involves write operations. Flux uses AWS Aurora Postgres, so we leveraged its “global database” feature to convert our existing cluster to span across two regions.

Aurora global database helps us achieve a pretty good RPO. In a disaster case that the primary cluster is shut down in an unexpected way, the data loss will be within seconds. It also allows us to run a “headless” cluster in the secondary region, so data are continuously backed up to the secondary region, but we don’t need to run or pay for a database instance in the secondary region when we don’t use it. This is a benefit of computing and storage separation.

Aurora global database also has its limits. The most prominent one is Aurora Postgres version only supports one write endpoint, meaning the even if we run a DB instance in a secondary region, it will be read-only. Aurora MySQL version solves it by “write forwarding” to the primary region, and Aurora Postgres doesn’t support it. This is probably the biggest reason we cannot easily support a multi-region hot-hot deployment with the current stack.

Aurora global database can do automatic failover within a region. Cross-region failover needs to be done manually by promoting the secondary region cluster as the primary one. A DNS cutover will also be needed.

Data Warehouse, Object Storage, Docker Repository

Flux use Snowflake for data warehouse, S3 for object storage, AWS ECR for Docker images. These long-term storages share a similar approach:

  • Set up corresponding storage (databases/buckets/repositories) in the secondary region.
  • Set up replication jobs from the primary storage to the secondary ones. S3/ECR replication is done continuously and the latency is generally within seconds or a few minutes. Snowflake replication jobs are scheduled at the interval we specified.

It might be a complicated process to sync data back to the original primary storage if failover happens on these services.

Configurations

Last but not least, configurations. We use AWS Parameter Store to store most configurations and allow environment variables override. Somewhat surprisingly, Parameter Store is a regional service so we had to copy the config over to the secondary region. Then we found some configuration values were regional, e.g. ARNs of some AWS resources. Instead of manually updating them, we converted them to environment variables defined in CDK stacks, so we don’t need to worry about keeping separate configuration values across regions.

What We Learned

There are some lessons we learned from this process. To list a few:

It pays off to do infra as codes

We use CloudFormation from the beginning to define most infra configurations (RDS was the only exception). And we recently switched to CDK because it’s easy to migrate and nice to use JavaScript to generate templates. It’d be unimaginable if we need to manually set up all these systems in different regions and keep them in sync. Infra-as-code is the way to go.

Your system only works in another region after it works in another region

We always had this cross-region goal in mind and tried to avoid any hardcoded regional resources. But over the years there were still some codes that only work in one region slipping into the system. We uncovered several issues during the process and fixed all of them. It’d be impractical to uncover them without doing an actual deployment.

Also, we identified a bunch of AWS resources we manually created in the primary region in the early days and never thought about them until now, like an SSL certificate, or a Lambda layer from a vendor. Moving to another region is a great exercise to make the infra codes more thorough.

Keep it running there

After we stand up a whole environment in the secondary region, we actually decide to leave it there continuously running as our preview environment. We also updated all deployment pipelines to point to the secondary region for this environment and tore down the preview environment in the primary region. This is in the same spirit as CI/CD, to ensure the system still works in another region when we actually need it.

This sums up our experiences of doing a cross-region failover setup. Thanks for reading. Any suggestions or comments are greatly welcomed!

--

--