Disaster Recovery in Action
--
If there’s one thing that’s certain, it’s Uncertainty.
We don’t have any control over “uncertainty”, but a good Disaster Recovery Plan (DRP) gives you the confidence to say to the face of disaster — Not Today.
Lets’s start with the basic question. Do you even need DRP?
Before answering this question, just try to think of a scenario, What will happen if a natural calamity strikes at your only Data Center and destroys your infrastructure. It’s obvious, you would be out of business in an instant.
What is a Disaster Recovery Planning (DRP) in an IT Industry?
It is having the ability to restore critical data and applications that enable the organization to operate normally. The DR outlines what needs to be done in case of a disaster in order to restore essential services so that the business can operate smoothly.
Below mentioned are some important aspects of DRP:
Let’s assume that we have defined the scope, identified risks with proper impact analysis and stakeholders informed. Now, we are ready to proceed with our setup over cloud, let’s say AWS.
Alternate Setup at AWS
The architecture is quite simple. In case of a disaster, the traffic would be shifted to some Temporary Server until our alternate setup is ready to serve as per RTO. Until then the Temporary Server would show a maintenance page with a message for the users. As soon as the alternate setup at AWS is ready, traffic would be shifted to AWS.
Broad workflow of a request would be:
Client → Akamai →AWS →ELB → Web Server →Web Application → Database
The traffic management can be done with Akamai GTM (Global Traffic Management) product which provides a centric approach to global load balancing towards increasing site availability, responsiveness and reliability.
The following scenarios can be managed through Akamai GTM:
a) Regular Day Traffic: In a regular operational day, all the traffic would be served from the Original Data Center via Akamai.
b) Disaster Acknowledged: This is the point where disaster has been acknowledged. This means our Original Data Center is down or unable to serve traffic. In such a scenario, maintenance page would be shown from the Temporary Server via Akamai. This can be done via manual intervention or through an Automated Process — Detection based on no response from Original Data Center for a pre-defined period of time
c) Disaster Recovered at AWS: As soon as the recovery process completes at AWS (the alternate data center), the entire traffic would be served from AWS via Akamai.
Data Synchronization
Data Synchronization is the heart of our recovery process. If not done correctly may result in complete recovery failure.
Data Synchronization/Replication is the process of copying data from one DB instance to another. In this case it’s our alternate data center. To achieve this, backup should be created at AWS with constant replication from Original Data Center, which would be used in case of disaster.
At any point of time, replication lag should not be greater than RPO, for which monitoring alerts should be in place. Achieving RPO is crucial as it’s been calculated & well documented after discussion with stake holders.
Application Setup
You need to consider the following while setting up your applications on AWS:
Server Setup
Your applications could be one of the following types:
a) Containerized: For such applications you need to do a kubernetes setup which will require setup of the following components:
i) Kube API server: It validates and configures kube resources over nodes.
ii) Kube tools on each of the node server: You need to install/setup kubelet, kube-proxy, kube-dns & docker etc.
A good thing about this kind of a setup is that, it’s just one time and you can deploy any sort of dockerized web application. Because it takes care of your applications dependencies. Now, you simply need to deploy your web application in the kube cluster.
b) Non-Containerized: Prepare the server by installing all application dependencies e.g; Installing PHP & its modules, Java etc. It needs to be done on every server per application as what application demands. Then you are ready to deploy the required applications.
2. Routing
Web request routing & Load Balancing: You need to setup a load balancer and a web server as per your applications requirement. As you are doing setup over AWS, ELB would be a better choice for load balancing.
Internal DNS Discovery: DNS discovery is very crucial to handle as most of your application’s internal interactions are based on it. To achieve it, you can setup your own DNS server or in this case use route53 which is extremely reliable. List all of your DNSes and feed them into it.
3. Deployment
You need to ensure a seamless deployment into AWS both auto & on-demand.
Automated: Every production deployment of your application at the original data center should be auto deployed to AWS too.
On-demand: During disaster recovery, if you find out that for some reason your application version is not in sync with what was deployed in Original Data Center, you can manually re-deploy that application with correct version.
Recording application versions: It’s crucial to record version of application which is deployed in original data center in some kind of DB resource which should be replicated in AWS. It will be the reference point for correct application version during disaster.
4. Physical Dependencies (DB Resources, Elastic etc.)
Perform installation of Database Engines, Elastic Search etc. and maintain synchronization between the DR Resources and Original Data Center Resources.
Mailers — Either setup your own SMTP server or in this case simply use SES (Simple Email Service)
Cost Optimized Scaling at AWS
To optimize cost we need to optimize the acquired infra, to achieve what we created. There can be three hypothetical states:
1) Dead State: Purpose of this state is to maintain a ready setup with minimum infra as it will be very difficult to do a setup from scratch in case of disaster. Switching from this state to another is very easy as you just need to scale up the infra.
2) Test Active State: This state is about testing applications periodically, checking if the applications are working fine without any issues. As the testing would be done manually or via automation script, infra required would be more than the dead state.
3) DR Active State: This state is meant to serve actual client traffic. At this state infra acquired would be maximum, It will be the state during Disaster Recovery.
You can use Ansible & Terraform scripts to acquire & setup infra.
Conclusion
A well documented and detailed Disaster Recovery Plan is critical to the survival of an organization. There should be clear processes and expectations in the event of a disaster in order to respond in a timely and effective manner. There is a lot of work involved and ultimately this is not a task that can be completed by any one person; we’ll need to assemble a team of people to accurately assess the needs and priorities of the organization.
Risk always exists, whether you plan for it or not. If you don’t, then you are accepting that risk, whether you like it or not.
Hit the 👏 (claps) button to make it reachable for more audience.