How to build a custom Disaster Recovery Process for AWS applications
And how much is it worth it
Preamble
Have you ever considered a scenario in which you lose half of the information that you stored in a cloud folder for your company’s new project because someone mistyped a command?¹ And what about losing the last 2 hours of your team’s chat history because of a lightning strike 100 miles away from you?²
Users rarely think about outages in the cloud scenario. Yet, like our home computers, servers can suffer from problems like power grid failure, software issues, or human errors.
These are the number of AWS outages in a single day. Lucky for us, these issues don’t always affect the end-user application. However, outages can sometimes propagate to multiple services and make the cloud provider unstable, creating a disaster scenario.
The objective of this post is to introduce you to Disaster Recovery (DR) processes and discuss cost-effective implementations, both native and custom.
Introduction
A DR process consists of two aspects: data redundancy and service high availability.
The methods to avoid data loss can vary depending on the amount of data that you can lose. There are two main strategies: asynchronous, and synchronous replication.
Asynchronous replication can happen anytime. We usually set a timer to update a replica with newly received data. If the gap between updates is too big, you will be more susceptible to data loss.
The synchronous replication maintains the data up-to-date all the time. We usually do this through methods that replicate the operations when they happen in the source.
Synchronous replication is usually much more expensive, and if you don’t mind losing some data, it’s better to stay away from it when possible.
Talking about losing data, in the DR world, you will see two acronyms, Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- RPO Indicates how much data we can lose. If you have an RPO of 12 hours, it means that the maximum gap between the last backup and the last usable state is 12 hours.
- RTO represents how long it takes for the service to be up and running again. For example, if you have an RTO of 4 hours, the maximum gap between the disaster event and the recovered state is 4 hours.
These values will vary depending on the application use cases and the business needs. We won’t get too much into details about this process because it’s out of our scope.
Service high availability is also an enormous challenge. The most common strategy is to move services to more stable regions when a disaster occurs. This requires a high degree of automation in the deployment process.
Cloud services have been providing solutions for the problems above for some time now. They usually build these solutions to be generic and fit most of the applications (as expected), making compromises in the process. These compromises usually involve expensive methods, like synchronous replication.
Take DynamoDB Global Tables⁴ for example, you pay for every replica as if it’s another normal instance. However, replicated write operations cost 50% more than usual.
These solutions often lack flexibility and have many requirements. That’s why we are going to focus on creating a custom solution.
Context
Since it’s extremely complicated to investigate cloud-agnostic solutions for this problem, we will be focusing on AWS.
Among the hundreds of services available in AWS, we choose to investigate S3, RDS, and DynamoDB because of their popularity.
To deal with deployment and infrastructure automation, Jenkins and Terraform are pretty common and will be our choice here.
Strategies
The plan to implement a custom DR solution involves three separate steps:
- Improve automation
- Implement service replication
- Implement data redundancy
DR processes usually use a primary and a secondary region. It’s a good idea to choose physically distant regions to reduce the chance of both regions being affected by outages at the same time.
To be effective, your team must ensure that the services are as region-independent as possible. That’s why we highly recommend the use of an infrastructure-as-code tool like Terraform to make this task easier.
Deployment automation
It’s a good practice to have decoupled services, but things don’t usually go as planned. Unfortunately, service dependency can make deployment automation harder, mostly because you have to invest in fail-safe mechanisms for every service.
In our experience, a good way to reduce the impact that multiple dependencies can cause is to use Jenkins to organize services into groups, from low to high level. Making sure that each service can only depend on lower-level ones.
This strategy can reduce the need for manual intervention in the build and release process, making it easier to implement an efficient service replication method.
Service Replication
The plan in the service replication aspect is to make it easy and seamless to move between regions. This means that services need to run and behave in the same way in any region without too much hassle. This provides the ability to switch between regions to escape instability.
To achieve this, the first step is to make sure that we share as much infra-code between regions as possible. We ended up creating a separate Terraform module to keep most of the common configuration and vars (usually a main.tf and a tfvars file).
To deal with the actual switching of regions, you can resort to Route 53 and what we call an Inter region router, to redirect the requests to the correct region. It’s also important to make sure that the second region is up and running before proceeding.
Data Replication
Services have a really specific cycle of life, as they are brought up, perform tasks and then cease to exist. When it comes to data, we have to propagate changes to other regions frequently to keep the RPO down.
The objective here is to create backups at a reasonable rate without causing the AWS billing to skyrocket.
As we mentioned before, our focus here will be S3, RDS, and DynamoDB. Each of them handles a different type of information and needs different methods to ensure that data loss is below the threshold.
S3
The cross-region replication tool for S3 has three major issues:
- Costs
- Requires versioning enabled
- Doesn’t replicate objects retroactively
The last one is still an open issue, but you can get around it using a copy-in-place operation on pre-existing objects or sending a request to AWS Support.
For the cost and versioning problem, we can create our replication tool using S3 events to trigger a lambda that will copy new data to another bucket. The function itself would look like this:
But how can this compare to the default implementation when it comes to cost and ease of use?
For cross-region replication, you pay for:
- Destination bucket storage
- Primary copy
- Replication PUT requests
- Storage retrieval
- Inter-region Data Transfer OUT from S3 to the destination region
Using the default method you also pay for:
- Replication Time Control Data Transfer fee
- S3 Replication Metrics charges (which costs 3 Cloud Watch custom metrics by rule)
To better understand these costs, let’s create a hypothetical scenario:
- Average of 50kb of new data per PUT request
- Number of requests a month varying from 1000 to more than a billion
- Each request takes in average 200ms
We are going to compare only the costs related to the DR solution, not the common ones (like requests, storage, and so on), so keep that in mind.
This solution is cheaper in every single scenario. We are not even considering the cost reduction by getting rid of the versioning, so you can expect a bigger gap.
The ease of use is a completely different story. AWS’s default replication method is almost a one-click solution.
RDS
If you use Aurora, Amazon provides a cross-region solution called Global Database, which automatically creates one or more copies of the original instance and handles the failover.
The default option for MySQL, Oracle, SQL Server, and PostgreSQL is to use read replicas.
These replicas replicate every operation as they happen in the source, this behavior is useful if you need a small RP value, but it can more than double the costs.
If a slightly larger RP is not an issue, you are better off with a custom solution, like leveraging lambda functions with CloudWatch events to create incremental backups.
The implementation looks a lot like the custom S3 replication script, as we use Python with Boto3 to create, copy and delete snapshots.
The actual trigger for the copy lives in the secondary region, which guarantees that the region is up before copying data. We can also log information and start another procedure if the snapshot fails because the primary region is down.
The major disadvantage of this solution is a bigger RP gap.
You may also have heard of Multi-AZ deployment. To clear up the confusion, it’s important to understand that AZ comes from availability zones, which are isolated locations within a region. They’re useful for redundancy but can’t protect you from disasters that impact an entire region.
DynamoDB
If data safety is a must, the native solution is your go-to option. It provides real-time replication using global-tables. However, the costs grow fast if you plan to use more than one region to replicate your data.
To cut down on costs, we tried to apply the same Lambda and snapshot strategy. However, AWS only provides full snapshots, making it expensive to create, transfer and store regular backups because of the size.
Another option is to leverage Lambdas and Streams to replicate every mutation. While much more effective, it’s also more complicated and expensive, due to its synchronous nature.
With this, you can replicate directly to your secondary DynamoDB, maintaining an on-par RP gap with the native solution. It also gives us the possibility to store this data somewhere else or even filter some information, which could help with costs.
In the chart below you can see the cost comparison, again we are only comparing DR-related costs.
This solution is cheaper, but the difference in cost is highly related to the amount of data that needs to be copied and for how long the Lambda function will run. In this case, we are considering 100ms to run and 50 Kb of data, which seems reasonable.
The ease of use is again a negative point, global tables are the easiest solution available by far. It’s also much easier to add more replicas in different regions if needed.
Result
This post proves that we CAN cut down on the costs by creating custom solutions. But the question that all this investigation raised is: Is it worth it to increase complexity in exchange for lower costs?
And what about RPO and RTO? The RDS custom solution is the only one that has a smaller RP value than its counterpart.
To be fair, AWS doesn’t even provide an effective RDS DR solution for non-Aurora engines, since Multi-AZ doesn’t support cross-region replication (and probably never will), making it hard to justify its use in a disaster recovery process⁷.
As a final conclusion, it’s really important to ask yourself if these costs actually represent a large percentage of your monthly bill and also if you have billions of writes in your data storage every month. If the answers to these questions are negative, you probably should stick to the native solutions.
Key Takeaways
- You can reduce cost using alternative methods
- Alternative methods are more complex than the default solutions
- Asynchronous data replication is usually complicated and costly because of AWS quirks
Things to consider
We dealt with moving services and data between regions and falling back to the secondary region. However, we still need to return to the primary region when possible.
It’s also important to think about data validation, reconciliation, automation level, RTO, and many other aspects.
Acknowledgment
Special thanks to Frankson Teotonho de Sousa, Marcio Luis Miranda de Macedo, Vitor Branco de Miranda, Marcelo Russi Mergulhão, Isac Sacchi Souza, João Augusto Caleffi, and Thalles Santos Silva for their insights and contributions.
This piece was written by João Pedro São Gregorio Silva from the Innovation Team at Daitan.
References
- [1] “AWS Outage that Broke the Internet Caused by Mistyped Command ….” 2 Mar. 2017, https://www.datacenterknowledge.com/archives/2017/03/02/aws-outage-that-broke-the-internet-caused-by-mistyped-command. Accessed 9 Mar. 2021.
- [2] “Microsoft Explains Sept. 4 Service Outage at … — Redmondmag.com.” 11 Sep. 2018, https://redmondmag.com/articles/2018/09/11/microsoft-explains-sept-4-outage.aspx. Accessed 9 Mar. 2021.
- [3] “AWS live status. Problems and outages for Amazon Web Services ….” https://downdetector.com/status/aws-amazon-web-services/. Accessed 9 Mar. 2021.
- [4] “Amazon DynamoDB global tables replicate your Amazon … — AWS.” https://aws.amazon.com/dynamodb/global-tables/. Accessed 12 Mar. 2021.
- [5] “Replicating objects — Amazon Simple Storage Service — AWS ….” https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html. Accessed 4 Mar. 2021.
- [6] “Working with read replicas — Amazon Relational Database Service.” https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html. Accessed 5 Mar. 2021.
- [7] “Implementing a disaster recovery strategy with Amazon RDS | AWS ….” https://aws.amazon.com/blogs/database/implementing-a-disaster-recovery-strategy-with-amazon-rds/. Accessed 12 Mar. 2021.
- [8] “Giant Hurricane Space” https://unsplash.com/photos/5477L9Z5eqI. Accessed 19 Mar. 2021.