Disaster Recovery strategies using AWS Serverless Services

9 min readJul 4, 2020

Often Disaster Recovery (DR) is an after thought, when Web service is about to reach its maturity state and getting ready for release, then we realized ohh! there are words like resiliency and high availability. And then we usually make the expensive choice or less performance efficient choice.

Its always better that one should factor Disaster Recovery early in the cloud architecture design, and I will try to cover details of some of the Disaster Recovery topics as mentioned in following mind-map.

High Availability, Fault Tolerance and Disaster recovery are closely related terms; However there is distinctive difference between them. Rather I would say making a Web service Highly Available or Fault Tolerant is a part and parcel of overall DR strategy for any given service.

What is Disaster Recovery in cloud computing world ?

It means your Web service/application should continue to operate normally, if some of the cloud service or availability zone or even entire region (which your service makes use of) goes down.

In my previous blog I have explained Batch job processor serverless service pattern. Lets try to create DR strategy for the same service. Following is the Service design (from previous blog).

Lets make above architecture Highly Available (HA)

AWS API Gateway and Lambda comes with inbuilt capability of handling the load and auto-scaling and these services are designed to sustain for some Availability Zone (AZ) downtime within region. However what happens when lets say region (say US-East-1) goes down (however this is very unlikely scenario) ? It will definitely hit our service, service will be down with no access to API gateway URL of our Front-end service which serves the http requests. So straight forward solution to solve this is to replicate the service infrastructure into another (fail-over) region and put it behind AWS Route 53 Fail-over routing policy. So modified architecture diagram would look like this.

We have replicated AWS resources in DR region (US-East-2) and then we have created Fail-over routing policy. Route 53 health-check monitors endpoints in primary region and if health check fails (due to say US-East-1 region goes down) then Route 53 sends traffic to fail-over region.

At a glance, above design does not look Cost efficient as we are directly replicating all the AWS resources into secondary region. However if you observe carefully then most of the services we are using are serverless. Front-end micro-service is using API Gateway + Lambda which are completely serverless, also scheduling service uses SNS + Lambda + SQS are also entirely serverless. It means we will not get charged for just provisioning those resources into the DR region, it will get charged only when we use them (i.e. in case of Primary regions goes down and fail-over happens). Still there is some cost associated with this design because Back-end service uses AWS cluster of EC2 instances and that is not serverless. So if we keep them running idle in DR region then we will need to pay the cost for same. We can easily improve this by automatically launching the Back-end service EC2 instances when there is a message in the queue in US-East-2 region. We can do that by adding a Lambda trigger, so that whenever any message in queue it will trigger the lambda function which can check if the Back-end service is up and running if not then it can spin-up the required EC2 instances. Obviously this approach will introduce some latency in the processing time in DR region because of EC2 instance startup time. I will talk later how to improve on this cold start time latency.

Now what happens to jobs which are in progress? those will definitely Fail because we are not handling them in DR region. Lets handle the ongoing jobs properly and make our Architecture more reliable.

Improve the Reliability of above Design

Although I have not mentioned in architecture diagram, but database is needed to track the submitted batch jobs. And the best choice in this case would be serverless fast NoSQL database and AWS DynamoDB is the answer for this. We can create Dynamo DB tables in primary region and then we can enable DynamoDB Global Tables: Multi-Region Replication feature, so that all the batch job data gets copied automatically to our DR region table replica. Now we have job records which may be in-progress in Primary region and then that region went down. We need some piece of code which will go through the in-progress jobs from DB table and relaunch them in to our DR region or may in primary region when it come up again. And favorite choice to write this piece of code on AWS Lambda 🙂

This way we are making sure that submitted jobs will be processed even in disastrous situation of region down and that improves overall reliability of our Service.

Fault Tolerance

There are multiple flavors to fault tolerance, up till now we have successfully tolerated region failure by deviating the traffic to passive region (US-East-2 in this case) and able to keep our service in operation even if our Primary region (US-East-1) goes down. Note that fail over switch happens only when “endpoint” in primary region is not reachable. However what will be the situation if one of the AWS service which is used internally (behind front-end service) fails ? The answer in this case is “Our service will fail to serve the request”. Now this is not a easy problem to solve, we need to handle individual service failures separately. Lets start with Availability Zone (AZ) failures.

Availability Zone (AZ) failures:

Most of AWS service components that we are using are serverless so we as consumer of AWS does not need to worry about the AZ failures, as these are taken care by AWS. Typically serverless offerings like SQS,SNS etc. are replicated in multiple region to mitigate the failures of individual AZs within region. However that is not the case with services like EC2 instances, this is not serverless offering. We have to manage the AZ failures for this. Easy solution to this is to replicate EC2 instances (for back-end service) in multiple AZ in any given region. For example say our service needs two EC2 instances running at any point of time, so we need to keep at least four EC2 instances 2 in each AZs to mitigate the failure of single region. Now we are operating in two regions to mitigate region failure. Hence we need to replicate same structure in our fail-over region which leaves us with 8 EC2 running instances (as shown below).

This looks pretty bad isn’t it ? Yes, this design is not at all cost efficient, we are keeping at least six EC2 instances running idle all the time, just waiting for disaster to occur. There are multiple ways we can solve this problem but I believe “Containerization” of the Back-end service is more appropriate solution to this problem. Once we containerized back-end service then it will be easy to launch them as per need, and that enables us to use AWS serverless compute for containers offerings like “Fargate” which can eliminate the idle time issue. There are multiple ways to deploy containerized services like Fargate, Kubernetes cluster etc. But this is a big topic and may be I will talk about it in separate post. For now we will use AWS Fargate to launch back-end services as per need. Final architecture diagram with Fargate changes as shown below. Note: grayed out Fargate icons in the passive region denotes, those fargate instances are not present (running in idle state) but will be launched when needed.

Individual Service Failures:

This is more challenging, than above two (Region and AZ) failure scenarios as it certainly need some piece of logic to handle the internal service failures. Lets assume that Front-end service (lambda) is not able to send the request to Scheduling service due to unavailability of SNS service. In such scenarios programmatic Retry mechanism would be one option. But this approach can handle temporary outage of the services, for example SNS publish call failed then we can write a retry logic to wait for some time and try to publish same message again also we can add number of retry attempts. But for longer duration outages we need some different strategy. We can think of some more sophisticated solution to have a unique state for such requests (which are failed due to internal service outages) like “unprocessed” or “paused” etc. so that a kind of Watchdog process which can periodically check for such “unprocessed” requests and resume its processing. This might be another topic for detailed discussion to explore about Design patterns for retry mechanism in distributed micro-service system.

Cost Effective Disaster Recovery Solution

By now you must have observed that we have thoughtfully converted our entire solution as a serverless and by just configuring the API Gateway + Lambda + SNS + SQS and Fargate in our passive region wont get us any bill from Amazon, they will charge only when we use these services and that is eventually only in case of Disastrous situations. So we can fairly and confidently say that our system design is pretty much cost efficient, obviously we can always improve on the cost as it is an ongoing process. With new services/options from AWS there will always be new/better way to do the same thing.

Implementation and Testing of Disaster Recovery Solution

Implementation would be mostly differ from service to service and based on the situation. As we discussed most of the changes we need are configuration or services choices changes so we hardly need to do anything programmatic way except the retry mechanism that we discussed above. So there nothing much to talk on implementation aspect of Disaster recovery.

Designing/Implementing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these “once in a blue moon” failures. And best way for testing the Disaster recovery solution is to introduce dependency failures, as well as node, rack, data-center/availability-zone, and even region failures. Listing down some of the practices.

1] Planned Game Days: where we can simulate the region or AZ failures and then we can check how our system responds to such events. How the DR solution kicks in and how much delay it introduces to serve the request (against normal working condition).

2] Introducing Unplanned/Random Failures: Chaos Monkey is one such practice introduce by Netflix, where they randomly disables production instances to make sure that they survive this common type of failure without any customer impact. This I would recommend as more of sophisticated strategy when your service reach at such a maturity level where you are confident enough to play with your production environment.

Conclusion

In the end the Cloud Technology is all about redundancy and fault-tolerance. Since no single component can guarantee 100% up-time (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link !