Disaster Tolerance Patterns Using AWS Serverless Services

Details for practical implementation and uncovering gotchas so you don’t have to

Ken Robbins
CloudPegboard
22 min readMay 3, 2019

--

Major AWS or network faults are rare, but do happen.

In my previous post (Disaster Recovery for Cloud Solutions is Obsolete) I asserted that you should design your cloud architectures for Disaster Tolerance from the start (even if it is counter intuitive to do so by lean principles). I also argued that you should do this because it’s easy if you do it now, and it will help your business even if there is never a disaster. The problem is that while that’s all true, in practice there are enough gotchas that what should be easy can take you down a lot of rabbit holes before you get to where you need to be. I’ve recently gone through the exercise for my current startup (Cloud Pegboard) and would like to share those learnings so that you get the benefits of what’s possible without having the go down and back from dead ends in the maze.

[While there is a synopsis below of all of the key insights, this is a long post. I considered splitting this, but that seemed artificial and less useful. The goal is to make this a useful reference (it’s really more of a white paper than a blog post I suppose). I guess you’ll let me know if I made the right choice.]

Ready to ride along?

Okay, here’s our challenge: create a new SaaS service on AWS that delights users, make it highly available even if there is a disaster or failure of the scale to knock out an entire AWS region or an entire service within the region, and do all this with minimal extra effort and expense to create and operate the service. We’re a startup, so we need to focus most of our attention on delivering user value but are confident enough on our future success that we know we don’t want to create a heap of technical debt that could have been readily avoided with just a little foresight.

Our general architectural approach is to exclusively use “high-in-the-stack” serverless technologies from AWS and fully automate our infrastructure (both as motivated by Cloud Architecture Principles). These decisions buy us a lot of resilience out of the gate and sets us up to readily handle disasters by making relatively minor architectural enhancements, leveraging other built-in AWS capabilities. The other strategic choice to frame the rest of this discussion is the decision to embrace a “Disaster Tolerance” (DT) approach instead of traditional Disaster Recovery (DR). Since Disaster Tolerance is not really (yet) a term of art, here’s my brief definition

Disaster Tolerance is the characteristic of a complete operational solution to withstand large scale faults without requiring any (or at least any significant) manual intervention. Disaster Tolerance is fault tolerance expanded to cover disaster level (e.g., region failure) faults. Disaster Tolerance contrasts with Disaster Recovery which is an approach that reacts to a disaster incident by executing a set of one-time “recovery” procedures to restore service.

Roadmap to this post

In the remainder of this post, I’ll describe a collection of discrete patterns used to add DT capabilities to the Cloud Pegboard service (a tool for AWS practitioners to keep up with the rapid change rate of AWS services). Without investing time describing the service architecture, for our purposes I can summarize to say that we have a web site powered by CloudFront, S3, Cognito, and Route 53, and a backend (APIs and data pipelines) formed from API Gateway, Lambda functions, Step Functions, and DynamoDB.

For each point of disaster risk in the technology stack, I’ll describe the pattern used and most importantly, the nuances, gotchas and other learnings that I think and hope will save you time by reading this instead of retracing wrong turns that we’ve already taken.

Since this discussion is based on an actual practical implementation, it certainly does not touch on all AWS services (e.g., I don’t talk about RDS, EC2, etc. at all). And of course, there are many ways to solve problems, maybe you have better patterns or may see flaws in mine. If so, speak up and share!

To gain full context and understanding, you’ll likely need to read the details, but to help guide you to the relevant sections and provide a reusable reference, following is a synopsis of the patterns and the key highlights.

Patterns and practices synopsis

S3 resilience

  • Use versioning and cross region replication for S3 buckets
  • Use CloudFront origin failover for read access to replicated S3 buckets

DynamoDB resilience

  • Use global tables for DynamoDB tables

API Gateway and Lambda resilience

  • Use a regional API Gateway and associated Lambda functions in each region
  • Use Route 53 latency or failover routing with health checks in front API Gateways

Cognito User Pools resilience

  • Create custom sync solution for now

Tips, gotchas, and other useful observations synopsis

AWS regions are not symmetric

  • The services you rely on may not be available in your chosen failover region. Review coverage before selecting regions

Amazon CloudFront origin failover

  • Origin Groups do not support POST (or PUT, DELETE). Therefore, origin failover cannot be used as a front-end interface to API Gateway (or other writable API interfaces) in a failover pattern
  • Origin Groups are not currently supported in CloudFormation (they are supported by CLI, SDKs, Terraform)

Amazon CloudFront and Authorization headers

  • Authorization header is not passed by default. Must add it to the headers whitelist if using CloudFront in front of an API Gateway or other endpoint that needs the Authorization header.

Amazon DynamoDB global tables

UPDATE: In November of 2019, AWS updated how global tables work in DynamoDB. That update mitigates the two caveats below. See this AWS blog post for details.

  • Global tables must be empty to configure or to add a new table to an existing global table group
  • When global tables are enabled, 3 special attributes are added to the table. Review to ensure they do not negatively impact your application

Amazon DynamoDB backup restores

  • Restoring must be to a new table (whether a regular or PITR restore)
  • This has particular implications when trying to restore a global table

AWS Route 53 health checks have some us-east-1 dependencies

  • Route 53 metrics (used by health checks) are only available in us-east-1
  • Can only send alarm notifications to SNS topics in us-east-1
  • You therefore may not get failover notifications if there is a major impairment of us-east-1 or one of the services required to detect failures and send notifications

Amazon Simple Email Service (Amazon SES) regional considerations

UPDATE: SES is now available in 6 regions: ap-south-1, ap-southeast-2, and eu-west-1 were added after this was first written.

  • SES is only available in us-east-1, us-east-2, eu-west-1
  • Domains and email addresses must be verified in each region you’ll use SES
  • Moving out of the SES sandbox and setting sending limits must be done for each region (via a support ticket) you plan to send from
  • For high volume senders, you may get deliverability issues if you suddenly ramp up traffic from an alternate region since the sending IPs will be different

Amazon Cognito has no backup/restore for User Pools and no cross-region synchronization

  • You’ll need to devise a custom replication solution (and likely have users do a password reset in a failover scenario)
  • Or wait out any outage, or use a different authentication service such as AWS Directory Service

Patterns and considerations explained

In this section I’ll explain the details behind the above summaries. Our goal is to develop Disaster Tolerance to either a full AWS region failure (highly unlikely, but possible) or the failure of a service within a region (relatively rare, but this does happen). We’d like our RTO (Recovery Time Objective) and RPO (Recover Point Objective) to be between 0 and 5 minutes. Our business does not necessarily require such aggressive targets, but these goals force us to architect in a way that follows a Disaster Tolerance mindset instead of a Disaster Recovery mindset — and as I’ve asserted previously, if done by design, this shouldn’t really cost more and has a much greater chance of working if actually needed.

If we said, sure RTO can be an hour or more, then we would likely have ended up with a plan that just says to re-run all of our CloudFormation and manual steps, and test this once or twice a year. The problem with that is that with infrastructure design drift, application evolution, and other variables (including subtleties such as long DNS TTLs that may be hard to test), would you have high confidence that if you needed to execute your DR plan 5 months after your last test, that it would work within the allotted hour? Plus, how much time does the entire team dedicate every 6 months to testing, repairing, and retesting the DR plan? This is exactly the same reason that for application development we’ve all moved from infrequent releases to CI/CD.

One last bit of context before we dive into the specifics. The patterns described below can refer to any two or more regions. In our specific case, we consider us-east-1 (N. Virginia) to be our primary site and us-west-2 (Oregon) to be our secondary or failover site (though for some aspects both are concurrently active as opposed to active-passive). As you know, AWS regions are not symmetric and not all services exist in all regions. For example, we use Amazon SES which is only available in 3 regions around the world. Therefore, where services are available should be considered in your initial planning. If you want to use a particular region, but it’s missing a required service, you could consider adding an additional region just for the missing services (though that starts getting complicated).

S3 resilience

Since we rely on S3 for our website hosting and part of our back-end data, we need to protect against an entire region failure or the failure of S3 in a given region (unlikely, but did occur in February of 2017). We’ll do this using these two patterns

  • Use versioning and cross region replication for S3 buckets
  • Use CloudFront origin failover for read access to replicated S3 buckets

S3 cross region replication is an easy and great straight-forward way to automatically ensure that you have a complete copy of all of your data in an alternate region. Versioning is a good data protection practice in its own right, and since replication requires it, you’ll need to turn it on anyway. Given the extreme ease and effectiveness of this, this is one of your biggest bangs for the buck. It’s serverless (meaning AWS manages everything for you) and just works with no further attention by you.

If you have a web presence as part of your solution like we do, likely and hopefully you are already using CloudFront. Since CloudFront edge locations are globally distributed and AWS uses smart DNS decisions to route your users to the best edge location, you get a lot of resilience just by using CloudFront in front of your S3. However, if you are relying on an S3 origin in a single region, an S3 regional failure will cause your service to fail. Only last fall, AWS announced a new feature for CloudFront called Origin Failover. This feature allows you to create a group of origins. If CloudFront determines that the primary has failed (based on configurable return statuses or a timeout) it uses the secondary instead. Since we have S3 replication configured, this works great to provide some peace of mind and automatic handling of a region failure. This is a good example of Disaster Tolerance instead of Disaster Recovery; it’s dead simple to configure, doesn’t really cost anything significant (websites don’t typically have Terabytes of data), and requires no action by a human operator to remediate.

Did I say dead simple? Well, yeah, I did but… All of our infrastructure is configured by code (we use CloudFormation with a little shell and Python on the edges). However, at the moment, Origin Groups cannot be configured by CloudFormation. Therefore, we need to use documentation and the console to configure Origin Groups (which are supported by CLI, SDK, and Terraform — I’d expect CloudFormation support will come soon enough).

DynamoDB resilience

The other place that we persist data is in DynamoDB. Therefore, here again we need to protect against a regional failure of the service. DynamoDB has had a very strong track record, but there was a significant outage in September, 2015. In late 2017, AWS announced a powerful DynamoDB feature, “global tables”. We do enable daily (via AWS Backup, but could be native DynamoDB on-demand backups too) as well as PITR (Point-In-Time-Recovery) backups, but that’s just part of normal good practices and not actually part of our Disaster Tolerance design. For DT, our pattern is use global tables for all of our production DynamoDB tables.

  • Use global tables for DynamoDB tables

Without any maintenance or attention, DynamoDB will continuously replicate your table to one or more alternative regions. This is really quite incredible. Note that while S3 cross-region replication pushes unidirectionally from a source bucket to a destination bucked in a different region, DynamoDB globally synchronizes all changes to all tables in a global tables group (last writer wins). Depending on your application design, you may need to take the multi-master attribute into account. In our case, it’s acceptable to have the any region write to a particular global table, but we still created an environment variable that can turn off (in a given region) the write to the DB in case we ever want to inhibit this behavior (e.g., for testing) (and there is another environment variable for the table name so we can also just write to a separate table if needed).

UPDATE: As of November 2019, creating global table configuration is different and easier. Please review updated documentation. One relevant post is here.

While this is a remarkable capability. There are a few important caveats that you’ll need to take into account. The most important challenge is that in order to create a group of tables as a “global table,” all member tables must be empty! Also, if you want to add a table to an existing group in the future, all tables again must be emptied first. Therefore, it is so much easier to set up global tables if you do this when you first create your infrastructure than trying to add it later.

If you do need to create global tables after the fact, then you have two options. One is to create a new set of tables (one in each region) and copy your original data to one of the new tables (the others will get the data automatically from the replication). If this is going to be your approach, you will likely do this multiple times. Therefore, you’ll be happy if you pick some reasonable naming scheme to add a semantic suffix to your table names and make it easy for your application software and infrastructure code to handle this table versioning. One possible gotcha here is that if you use good practices for IAM policy statements and give specific resource arns, then it might break your authorization if you change the names of your tables and don’t either update all relevant IAM policy statements, or make sure that your table resource attributes have an appropriate wildcard. This is all pretty simple, but you can see how if you don’t plan it out a little from the start, it might get nasty to debug and retrofit.

Another approach (may require a service interruption depending on your data volume) is to empty your tables in place, apply the global table configuration, and then reload the tables. Yeah, with infrastructure as code it should be easy to just create new tables, but sometimes there are just a lot of dependencies (different application software using a specific name, triggers, IAM permissions, etc.). So, if you are just trying to add a new region (whether you previously had no global table configuration or are adding to an existing configuration) then a dump-empty-configure-load pattern may be appropriate. AWS has several tools that can help here: AWS Data Migration Service, AWS Data Pipeline, or custom CLI- or API-based code. In my case, I felt like I was yak shaving to configure and use DMS or Data Pipeline just to complete this very narrow task of enabling global tables (and our data volume is low). I decided to just whip up a library of DynamoDB utilities in Python using Boto3 (should have done this years ago actually). I don’t know if this was the best choice or not (it took me a few hours longer than I thought it would), but I have got a lot of reuse from my little library and it makes it straight-forward to code various DynamoDB automations for other purposes as well.

[I have not posted my library to GitHub since I haven’t (and don’t plan to) tested on configurations and data beyond my use case and it would need some work to make it publicly useful. If there is interest, I might be peer pressured to putting in the time to make it public-worthy. Also, I later found, but did not use dynamodump which seems to have much of the functionality of my library (minus the global table configurations).]

Example signature of utility function to create a global DynamoDB table if table is not empty

DynamoDB Decimal strikes again

Great, now we have global tables. When I first tested with my new configuration, even in my primary region with known working code, I got a Python exception. Why? Well, once you configure for global tables AWS adds three attributes to each global table in the group:

Example of AWS-added attributes in global DynamoDB tables

This is fine, but I had multiple places where I pass back the entire item. However, I also use Python and in particular the json.dumps() function that does not have a Decimal type encoder. Since I had otherwise avoided using Number types in my schema, aws:rep:updatetime was the first and caused an exception. This won’t bite everyone, but knowing that this is happening behind the scenes might save you some debug time. If you happen to be using Python, here’s how I solved this for the json.dumps() case.

Keep this around. You’ll use it a lot if working with Python and DynamoDB.

At other times I don’t want the attributes at all and therefore used item.pop("aws:rep:updatetime", None)to get rid of the field entirely.

UPDATE: The November 2019 update to global tables means that the aws:rep attributes are no longer in your table. Note that there is the notion of a version on global tables now. The newer version that does not insert aws:rep attributes is 2019.11.21. The legacy global table version is 2017.11.29.

DynamoDB restores only to new tables

Most everyone knows this, but it’s relevant to any discussion on disaster recovery or tolerance. When restoring from a DynamoDB backup (whether produced from an on-demand or PITR backup), you must restore to a new table. If you are using global tables, then you get a dependency deadlock since to create a global table it must be empty but restoring can’t restore into an existing empty table. Here’s where I was again glad for my little bag o’ DynamoDB utilities. To restore, I just restore to a new temporary table (not global) and then use my utility to dump from the table to an S3 file. Then I use the other utilities to empty the existing global tables and reload from the S3 file that I created from the restored table.

API Gateway and Lambda resilience

For APIs, we use Amazon API Gateway that invokes AWS Lambda functions. To provide resilience to these services, after some dead ends, we ended up with the following pattern:

  • Use a regional API Gateway and associated Lambda functions in each region
  • Use Route 53 latency (or failover) routing with health checks in front API Gateways

Since we had good success with the CloudFront origin failover strategy mentioned earlier, and since an API Gateway endpoint is a valid origin type, it seemed clean and reasonable to simply put our API Gateways behind CloudFront and rely on origin failover in the same way as we do for S3. We went down this path for a while, fighting typical debugging and design challenges (which can take a very long time with CloudFront since even small changes can take 20–40 minutes to complete due to CloudFront’s worldwide presence and corresponding update latencies). Eventually, we got this working for our simple GET REST calls. I was happy. Having achieved that milestone, we of course then tested a POST API, fought with this for a while, and then finally noticed the doc (which wasn’t subtle, I just missed it) that noted that Origin Groups only work for GET, HEAD, and OPTIONS methods. I was now sad.

One of the tangents in the original debugging still did yield some reusable learning. In attempting to get the API Gateway authorizer to work, after much tricky debugging (visibility is limited for this), we realized that CloudFront does not pass the Authorization header by default. If you do happen to use CloudFront in front of API Gateway and need the Authorization header, then you need to explicitly whitelist the Authorization header in the CloudFront configuration. Note that this applies to any backend endpoint that needs the Authorization header even if you aren’t using API Gateway in particular.

Where did we end up?

Since now you know what not to do, what was our final design for being resilient to failures of the API Gateway or Lambda services in a given region (note that most AWS services use other AWS services under the hood, so a failure in some other regional service could indirectly cause an outage in one of these services). Our pattern is to deploy a “regional” API gateway in each redundant region, create health checks in each region and then use Amazon Route 53 to route to the best region using latency routing. If a given API Gateway fails or any underlying service including our health check endpoint fails, Route 53 (a globally distributed service) will simply route traffic to the alternate region. We use latency routing and an active-active configuration (since S3 and DynamoDB are replicated, it doesn’t matter which region the Lambda runs in). The same pattern holds if you want to run in an active-passive mode and use a failover routing policy for Route 53.

Note that in this pattern we’re using regional API Gateways since we want control of where our resources are since we are designing against a possible regional failure. However, regional API Gateways do not have CloudFront automatically deployed on your behalf (as “edge optimized” API Gateways do). Since we abandoned the original CloudFront failover pattern, we now have no CloudFront optimization (only provides network optimization since we don’t cache our APIs) and only have a little performance optimization by using latency routing. Our APIs are fast, and this is not a problem for now. However, in our future and if you assess you need the boost, then an improved pattern is to deploy a unique (and with no origin groups) CloudFront distribution in front of each regional API Gateway. I don’t describe this further, since we haven’t actually tested adding CloudFront in this way just yet.

We strategically chose a serverless architecture for Cloud Pegboard for various reasons, Disaster Tolerance being just one. It’s worth taking a moment to appreciate how a simple decision gives us such extreme benefit that is so simple it’s easy to overlook. By relying on AWS Lambda functions, we have no static overhead or cost related to the ability to run our compute instantly in a different region. We have no servers and images to patch, sync, store, or maintain. As long as we have our data local (which we readily solved with S3 cross-region replication and global tables for DynamoDB), then using AWS Lambda allows us to have a no cost, no maintenance active-active or active-passive compute capacity with no warm-up time or effort.

Eyes open on Route 53 health checks

As described above, Route 53 will route to the proper region based on latency during normal operations. In some failure modes, latency will go high (or infinite) and this will cause traffic to route away from the bad region. However, mostly we are relying on Route 53 health checks to probe a representative endpoint in each region and to use that to determine if it’s okay to send traffic to that region. It turns out that the Route 53 metric (used by health checks) are only available in us-east-1 (even though Route 53 itself is not a region-specific service). Additionally, while these health checks allow you to send notifications to an SNS topic of your choice, that topic must be in us-east-1. Therefore, even if the failover operates as expected, if you are relying on the notification to be aware of the event, then if the failure happens to cause us-east-1 SNS or Route 53 metrics to fail, then you will not get notified. There are many other ways to set up notifications for when you have experienced a failover event, so this is not a critical flaw. However, you should be aware of it so that you don’t overly rely on getting the notification in case that’s an important part of your remediation and operational workflow.

Simple Email Service considerations

Amazon SES is certainly less commonly used compared to the other services that we’ve been discussing, but if you do use it, there are several disaster tolerance/recovery issues of which you should be aware.

UPDATE: SES is now available in 6 regions: ap-south-1, ap-southeast-2, and eu-west-1 were added after this was first written.

Prominent among these regional considerations is that SES is only available in us-east-1, us-east-2, and eu-west-1. This may affect your region choices. Alternatively, you can set it up in a regions separate from your other regions and use Amazon SQS or cross-region API calls to send your outbound emails.

SES has the notion of requiring sending domains (or email addresses) to be verified in each region where you use SES. Not a big deal, but you don’t want to be surprised about this if you suddenly need to send to external emails from a new region. It would be easy to miss this since you would only be restricted when sending emails to external email addresses; if you were sending test emails to your own domain, they would go through even if your domain had not been verified (sandbox mode). On a related topic, “moving out of the sandbox” and establishing appropriate sending limits requires a support ticket for each unique domain that you use as SES sandbox status and sending limits are regional attributes.

Finally, for high volume senders, you are likely to get deliverability issues if you suddenly ramp up traffic from an alternate region since the sending IPs will be different. This is because ISPs and mailbox providers use your “IP reputation” as part of their algorithms for determining whether to accept or throttle your traffic. In a failover scenario, if you start to send a high rate from previously low volume IP addresses, you will likely get blocked or throttled. IP reputation therefore may be a reason to ensure that your architecture balances load across your SES regions instead of using multiple SES regions in a failover mode. [Note that this last paragraph is based on my experience building high-volume email sending solutions, but not with SES. I am not 100% confident in my understanding of how AWS manages IPs in the backend of SES, and do not yet have experience using SES at high volume.]

Cognito User Pools resilience

For user identity management, early on, we selected Amazon Cognito User Pools. This got us up quickly and effectively, and like our other serverless choices, requires no maintenance. However, we had not considered DR or DT factors during the initial design (which is one reason why this article is stressing the value of doing do). When we got to the point of hopefully just tweaking the architecture for tolerance to Cognito or region failure, only then did we recognize some challenging limitations. Amazon Cognito currently has no native support for developer accessible backup and restore of User Pools. Additionally, there is no notion of cross-region replication or synchronization. Also, be aware that if you use AWS Amplify, Amazon Cognito User Pools is the service that is providing authorization for that framework.

To address this challenge, we have an interim solution and are considering future options. At present, we have a Lambda function that is triggered on events that change the Cognito User Pool. This function keeps a shadow copy of the user pool data in a DynamoDB table (with some additional parameters that we use for other application purposes). These data can then be inserted via API into our alternate region’s user pool instance. In a disaster mode we can failover to the alternate user pool. The achilleas heel of this approach is that we do not have access to password hashes. Therefore, in a full failover where we cannot access Cognito (as determined by client application logic as opposed to the central routing type of solutions described for our other services), we need to force users to go through a “forgot password” flow. This is clearly not a good solution for users and not a suitable long-term solution.

We are clearly not satisfied with our current design. Fortunately, we are talking about disaster recovery/tolerance and therefore there is very low probability that we’d need to engage this failure mode in the next several years. By the time that we might see a Cognito failure in us-east-1 (which might be never), I suspect that AWS will add the necessary capabilities to Cognito. If we get surprised, then we have at least something, even if odious. It does sully the elegance of the rest of the Disaster Tolerance, but if this is the only corner of inelegance in our DT design, then that does not diminish the value of the rest of the architecture. This is so since the probability of a failure of a single service in a region is much greater than a failure of the entire region. Of course, another possibility is that if we really hit that extremely low probability of a Cognito User Pools failure in us-east-1, then we also have the option of just waiting it out since while we desire and have generally designed for a 0–5 minute RTO, our business can tolerate hours if need be.

An alternative approach that I’ll mention but not promote until we’ve actually implemented and tested it, would be to use the AWS Directory Service (also known as AWS Managed Microsoft AD) as the core of our authentication and authorization capabilities.

Conclusions

Architecting for Disaster Tolerance instead of Disaster Recovery is much more likely to result in solutions that actually work when major regional outages occur (which they will). Designing for tolerance instead of remediation plans also saves a great deal of effort required to test and repair DR plans for years to come. By making initial strategic design decisions to use serverless technologies and applicable patterns that allow for tolerance of regional outages, for very little extra initial effort and small incremental cost, we can build solutions are that are tolerant of disasters as a natural attribute of the solution that remains continuously in place even as infrastructure and application code evolve on a daily basis.

All that said, details matter and architectures that look good on napkins aren’t always realized as intended when specific service capabilities and limitations are taken into account. In this article we have shared our learnings on these details so that you can get more directly to workable solutions and benefit from our path finding.

Resources

Rather than linking throughout the above (seemed messy and hard to use for future reference), I’ve collected the various services mentioned here for easier reference. For AWS services, I’ve provided a link to the AWS main product page as well as a link to the Cloud Pegboard “datasheet” detail page that corresponds to each service and contains the essential information about each service on one page.

About Cloud Pegboard

Cloud Pegboard helps AWS practitioners be highly effective and efficient even in the face of the tremendous complexity and rate of change of AWS services. We want to keep AWS fun and feel more like dancing in a sprinkler on a hot summer day and less like being blasted with a firehose. We make the AWS information that you need amazingly easy to access, and with personalization features, we help you stay up to date on the AWS services and capabilities that are important to you and your specific projects.

--

--

Ken Robbins
CloudPegboard

I’m the founder of CloudPegboard.com a powerful tool for any AWS practitioner trying to keep up with the complexity and rate of change of AWS services.