AWS disaster recovery

Having a Disaster Recovery plan in place can protect and ensure the data is safe from any catastrophes.

Sathvik
Code Direct
4 min readMay 4, 2021

--

Photo by Patrick Perkins on Unsplash

Disaster Recovery is one of the most important aspects while architecting a solution in the software applications. Having a Disaster Recovery system in place can protect and ensure the data is safe from any catastrophes.

According to Gartner estimates the average cost of IT downtime is $5,600 per minute.

With the introduction of AWS, all of the service has SLA (Service Level Agreement). SLA is an agreement between providers in the AWS and the client (user).

For instance, consider the Simple Storage Service which has a Durability SLA of 99.999999999% and Availability SLA of 99.99% for a given year.

When it comes to availability 99.99%, it translates to 52 Minutes downtime over a year, which is quite impressive when compared to maintaining on our own. If an application requires more availability than the AWS offers, then there is a need for a alternate solution in which DR can help us achieve.

The Main concepts of Disaster Recovery revolve around — Recovery Point Objective and Recovery Time Objective, will talk more about them below.

A business continuity plan is a process definition when a disruption of services occur. It includes how much data loss is acceptable and the maximum allowed time to recover all of the lost data etc. This definition changes based on the criticality of the business. For financial services data loss is unacceptable and based on the service the time to recover all the data till the point of disaster can vary.

Recovery Point Objective (RPO)

RPO is the Recovery Point at which the data can be restored after the service disruption from a disaster. In simple words, when a disaster leads to disruption of services at what time can the services be recovered from the backups.

Example: After the service disruption the data can be recovered from 12 hours ago since the data is backed up every 12 hours, which implies the recovery point in time for this service is 12 hours. All the data after the last backup time is lost and has to be reentered or should be redone. All the conditions go in tandem with the business continuity plan.

Recovery Time Objective (RTO)

RTO is the duration to recover data from the latest backup until the time of disruption. Example: If the backup is made every 1 Hour and services went now which is after 50 mins then the time to recover data from the previous hour back up until now is RTO. RTO is dependent on the RPO and time to recover can vary based on the business continuity plan. Critical applications can have frequent backups that can improve the RTO and minimize the overall downtime significantly.

Calculating the Availability of a Solution

Now that we know about the availability of services we can calculate the total available percent of a given solution. If the availability of a service is not known then it can be computed by the Mean time between failures (MTBF) and the Mean time to recover (MTR).

Availability Approx. = MTBF / (MTBF + MTR)

Example if a service has MTBF of 220 Days and MTR of 10 Hours — the availability of that service is (220 / ( 220 + (5 / 24) ) ) = 99.81%.

The availability for a combination of services in a solution can be calculated by multiplying their availabilities. Example: Let us consider two services — Service A and Service B which are dependent have the availabilities of 99.99% and 99.99% respectively. Then the availability of the total solution is

99.99% * 99.99% = 99.98%

Availability for Solution = Multiple of dependent Services Availability

If an application has redundant services calculating the availability differs by calculating dependent services instead redundant services have to be subtracted from 100% before multiplying across the services downtime percentage.

Availability = 100% — Multiple of dependent Redundant Services Availability

Example: if two EC2 instances ( SLA 99.99% ) of the same applications are deployed in different availability zones then the availability is 100% — (0.001% * 0.001%) = 99.999999%

On a closing note AWS or any cloud providers provide the best services possible and when it comes to storage and compute they are getting better and better over time. Based on the business continuity plan and availability of the cloud services the DR plan can make a solution better by avoiding any data loss.

Want to know more about hosting a website in AWS check our post here.

--

--

Sathvik
Code Direct

Learning , Making computers do stuff, crypto explorer, living in Cayman.