Why do we need IT Disaster Recovery? In an IT industry, we have heard a lot of stories regarding data loss and hardware failure. If we didn’t have any architecture for disaster recovery, it would lead to the losses in the business(without any proper architect for disaster recovery, businesses would suffer losses). Most of the organisations are vulnerable to a range of outages and disasters.
The disaster could be due to computer viruses, vulnerabilities in applications and disk drives, corruption of data or human error. It could also be due to natural disasters such as fires, floods, power failures or weather-related outages leading to IT failure.
No business is invulnerable to IT disasters, but a speedy recovery from a well-crafted IT disaster recovery plan is expected by today’s ever-demanding customers.
Develop a solid IT disaster recovery plan. Save money, save your customers, save your business.
Amazon Web Services provides disaster recovery solutions for customers to develop robust, cost-effective, targeted and well-tested recovery plans. There are two main key metrics to remember:
- Recovery Time Objective (RTO): is the maximum interval of time, that your application goes offline.
2. Recovery Point Objective (RPO): is the maximum time, in which your data might be lost from an application due to the disaster. RPO describes only the interval of time and doesn’t address the amount or quality of the data lost.
If we need lower the values of RTO and RPO, then the cost of running the application will be higher.
Depending on these metrics, AWS offers 4 basic techniques for back-up and disaster recovery.
- Back-up and Recovery
In this technique, data is backed up to the tape and sent off-site regularly. Amazon S3 is the destination for data backup. For long term data storage, we use Amazon Glacier, which has the same durability as Amazon S3, but the difference is that the cost is lower compared to S3.
With Amazon S3, restoring a process is pretty fast compared to Amazon Glacier. As the retrieval time is more in Amazon Glacier, it is used to store old backup files.
AWS Storage Gateway will copy the backup to Amazon S3 by taking the snapshot of the data. The following figure shows data backup options to Amazon S3, from either on-site infrastructure or from AWS.
If a disaster occurs, we need to recover the data very quickly and reliably. The following diagram shows how to quickly restore a system from Amazon S3 backups to Amazon EC2.
This technique is simple and cost-effective, however RPO will be huge and there will be a downtime before restoration.
2. Pilot Light
In Pilot Light method, the recovery time is less compared to the backup-and-recovery method. In Pilot Light method the core piece of the system such as a database is already running and up to date in AWS.
The database is always activated for data replication and for the other layers, server images are created and updated periodically.
In Pilot Light, the RTO and RPO are low and it just takes a few minutes for recovery. Amazon Cloud Formation can be used to automate the provisioning of these services.
The instances are created by the backed up AMI. We can configure load balancing and auto-scaling, so that when the traffic goes high the service will scale up automatically. The DNS needs to update in parallel.
The following figure shows the recovery phase of the pilot light scenario.
3. Warm Standby
Warm Standby is an extended version of Pilot Light. It reduces the recovery time further, because in warm standby part of the service is always running.
In warm standby, the recovery time is reduced to almost zero by always running a scaled down version of a fully functional environment. At the time of recovery point, if the system fails, the standby infrastructure will be scaled up with the level of the production environment, DNS records are updated and it routes all the traffic to a new AWS environment. So this approach reduces RTO and RPO but the cost will be high due to the fact that an alternate system is running 24/7.
The following figure shows the preparation phase for a warm standby solution, in which an on-site solution and an AWS solution run side-by-side.
In case of failure of the production system :
4. Hot Standby (Multi-Site)
In Multi-Site, the application runs in AWS as well as on the existing infrastructure also. Here the DNS service supports weighted routing. The traffic will go to the standby infrastructure as well as the existing infrastructure.
If a disaster occurs on the existing system, the whole traffic is routed to the new AWS environment. By using auto-scaling, the capacity of services rapidly increases to handle the full production load. Here you can achieve zero RTO and RPO but the cost would be high.
The following figure shows how you can use the weighted routing policy of the Amazon Route 53 DNS to route a portion of your traffic to the AWS site. The application on AWS might access data sources on the on-site production system. Data is replicated or mirrored to the AWS infrastructure.
The following figure shows the change in traffic routing in the event of an on-site disaster. Traffic is cut over to the AWS infrastructure by updating DNS, and all traffic and supporting data queries are supported by the AWS infrastructure.