Day 5 of the 30-Day Cloud Challenge

4 min readMar 6, 2024

Challenge: Set up a disaster recovery plan and illustrate how to conduct a failover test to ensure resilience.

Hey everyone!

Back at it again,

First off, let’s talk about why a disaster recovery plan (DRP) is crucial. Think of it like insurance for your organization’s IT infrastructure. Just as you wouldn’t drive a car without insurance, you shouldn’t operate without a DRP. It’s all about being prepared for the “what ifs” and ensuring that you can bounce back from any disasters or disruptions.

A disaster recovery plan is essentially a playbook that outlines how your organization will respond and recover from catastrophic events. Its main goal is to minimize downtime and data loss, ensuring that critical business functions can be restored swiftly. It’s important to have a DRP for every aspect of your organization’s cloud infrastructure that applies to your business-critical mission (storage, compute, database, networking).

Now here are the common DR strategies:

Backup and Restore: This involves creating copies of your infrastructure in a separate location. If a failure occurs, you can simply restore from these backups. While this approach is cost-effective, it typically has the highest Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Pilot Light: In this strategy, a small version of your application is always running in the cloud. This allows for a quicker recovery compared to backup and restore. This strategy has a lower RPO and RTO than backup and restore.
Warm standby: Here, a full system is up and running, but at minimum size. In the event of a failure/disaster, we can scale the system to production load.
Multi-Site: This strategy involves having a full system running in the background, ready to take over in case of a failure. It offers the lowest RTO and RPO but is also the most expensive option.

Here’s how to develop a DRP:

Define RTO and RPO:

Recovery Time Objective (RTO) is the max acceptable downtime for your service.

Recovery Point Objective (RPO) is the maximum acceptable data loss.

A low RTO (<15 minutes) then you can't rebuild your infrastructure from scratch, and you shouldn't use backup & restore. you'll need something that will build and deploy your infrastructure quickly.
A low RPO is losing the least amount of data as possible because data loss can lead to financial losses, regulatory compliance issues, or threats to public safety.
Fields that require low RTO and low RPO are healthcare, government, and financial services.

You should have a DRP for every department that relies on cloud infrastructure (HR, finance, marketing, etc). AWS focuses on four key services, here’s how a DRP would look for each services:

Storage: For low Recovery Time Objective (RTO) requiring immediate access, S3 is ideal. For higher RTO, consider using Deep Glacier, which offers longer retrieval times and is cost-effective.
Compute: For high Recovery Point Objective (RPO) and RTO, creating a pool of snapshots for instances, manually recreating instances, and deploying them in a single Availability Zone (AZ) is recommended. This essentially follows a backup and restore approach. For low RPO and RTO, utilize Elastic Load Balancer to automatically create new instances, deploy them across multiple AZs to ensure data preservation, and implement step functions for continuous data replication to prevent data loss.
Database: In scenarios requiring high RTO and RPO, deploying in a single AZ or region and manually backing up the database to S3 is suitable. For low RTO and RPO, deploy across multiple AZs, enable read replicas, automate snapshots frequently, enable point-in-time recovery (PITR) for database recovery, use AWS Backups, and enable cloning for testing purposes or as standby instances.
Networking: For high RTO and RPO, deploy in a single region. For low RTO/RPO, deploy across multiple regions to ensure redundancy and minimize downtime.

2. Identify Mission Critical Services and figure out your DRP

It’s good to remember that Not all services are created equal. To avoid additional cost remember your mission critical services and prioritize them in your DRP. According to the AWS Whitepages “it's important to note that recovery objectives should not be made in isolation; the probability of disruption and cost of recovery are key factors that help to inform the business value of providing disaster recovery for a workload.”

3. Test your DRP

what’s the point of creating a plan and not testing it to see if it works? Here’s how you should test your DRP:

Plan your test (yes, it's best to write out another plan.)
Outline your objectives, hypothesis, procedures (try to write down what-if scenarios.
Create a success criterion.
Document results, lessons, any deviations, recommendations, vulnerabilities/gaps, and potential threats)
Test your DRP by switching traffic to your standby services, hire a hacker to cause a system failure/attack, etc.

4. Alert your stakeholders of the test

Keeping your stakeholders in the loop is critical to build trust, gather support, an opportunity for additional feedback, and they likely have an authority influence of the project, so it's best to keep them in the loop.

When alerting your stakeholders, ensure to provide information on:

Test schedules
Potential impacts
Results
Procedures

5. Monitor and setup alerts

For your DRP, you need to constantly monitor issues. Once you observe any anomalies, you’ll know to trigger your DRP. This proactive approach will help you address failures quickly and deploy your DRP.

In conclusion, a well-thought-out disaster recovery plan is essential for ensuring business continuity and minimizing the impact of disruptions. By choosing the right DR strategy and regularly testing and updating your plan, you can be prepared for whatever challenges come your way.

Here’s the daily prayer, as promised :)

Heavenly Father,

Grant me strength for today, Guidance for tomorrow, And peace for the days ahead.

Amen

Day 5 of the 30-Day Cloud Challenge

Written by Brittany Washington