How to plan for Disaster Recovery in the Cloud

Published in

The Cloud Builders Guild

8 min readFeb 16, 2020

Operating your application services in the cloud as opposed to in your own datacenter is where we are all headed. What does that mean for the old DR plan?

“The Cloud is just someone else’s hardware”

That was a true enough statement in the early days of the Cloud but not any more. Even if your migration plans focus on “lift-and-shift”, “like-for-like” or “Hybrid-cloud” initially you will soon learn that you are entering a new era of infrastructure, application and data management.

Once it comes time to refresh those Disaster Recovery plans your SaaS, IaaS and PaaS solutions will make it hard to come up with a meaningful way to design your approach. In some cases you may have to ask your self: do we still need a DR plan or is it something else we need now?

SaaS, IaaS and PaaS, what is the difference?

Most organisations are using a combination of all three modes but each one requires a different way to think about how you ensure your ability to recover from adverse events. In the case of IaaS you sometimes have all the same responsibilities and approaches as before but SaaS and PaaS solutions will always require a different approach than on-prem DR.

Infrastructure as a Service (IaaS)

Using Cloud services to run your environment the same way as you did on-prem is IaaS. You open and account with AWS or Azure and provision Windows or Linux servers on a network you have set up and migrate your servers over to there. You may even keep some on your On-Prem Data Center and set up a Hybrid-Cloud with servers talking together over a secure link.

Your applications will still run on servers managed by your team and the way they are operated will look the same or very similar to how they were managed before.

If this is your scenario then the DR approach will be the same as it was before. You just have to update the procedures for how services are accessed and where your backups are now being stored and your done. Your RTO/RPO objectives will be the same, your plans will need updating but no major overhaul.

Platform as a Service (PaaS) and Serverless

The next level up in handing responsibility for running your services over to a third-party is using “serverless” technologies and PaaS solutions. The two are similar in nature. The difference between the two can best described as PaaS being the mature version of Serverless. AWS DynamoDB and AWS Lambda are Serverless technologies but AWS and Azure are Platforms as a Service.

In a PaaS world you don’t run or control the servers that your applications and data run on but you control its configuration and the data it handles. You are responsible for making sure the applications and data are running correctly and reliably but in the case of failure you have no access to to servers to make changes. This is where your DR strategy starts changing.

As your service will be dependent on the PaaS services that they run on your DR plans will have to take into account that the redundancy solution will require the same PaaS services to operate the shadow environment or restored backups. Your DR plans will have to articulate where the services are running in your production environment, where they are being backed up to or replicated and where and how you will restore them in case of a Disaster.

The important thing here is to consider the options available for recovery. If your service is running on DynamoDB and Lambda in AWS you will be able to design for redundancy across multiple data centres which will increase your resiliency but in the case of a Disaster that wipes out AWS in your region you will need a plan for how you can recover the services in a different region taking into account your RPO objective.

Further more your DR plans will have to make it clear that the services could not be restored in the Microsoft Azure Cloud in case there was a problem with AWS Globally or if your company had an issue with AWS commercially. The PaaS limits your recovery options to the platform they run on.

Software as a Service (SaaS)

The website you are reading this on is a good example of true Software as a Service. These words are written on a web based editor and saved on Medium.com. There is no application or data level backing up or restore that I am able to perform to recover from a potential disaster that Medium.com might go through. Everything I do is dependent to Medium to provide adequate confirmation of their ability to keep the platform secure, robust and redundant.

In this case it is important for your DR plan that you review the credentials the provider has achieved to that effect. This type of software is very easy to consume and anyone can sign up to use it. To what degree organisations manage the use of SaaS depends on the software and the nature of the organisation but any business critical SaaS application will require due-diligence to validate the providers ability to recover from adverse events.

Not all services that are called SaaS are true SaaS though. Many software providers have moved their applications to the Cloud but in reality they are just running a hosted application service in the background. In that case it will depend on the architecture of the solution and the details of the hosted service contract to what degree you can plan your Disaster Recovery.

If you are able to get access to or have a copy of the data-sets and application artefacts you want to explore storing them in your own cloud instance and having the ability to stand them up in case of a Disaster. This would allow you to have an independent DR plan that will look fairly similar to a traditional plan.

So, what now then?

Although DR plans are still relevant it is important to go back to the driver behind the requirement for having a recovery plan. Generally this would be found in a Business Continuity Plan or the Organisation Risk Register. Wherever it is that the business documents its level of tolerance for disruption and risk to the business processes that the services you operate support it is willing to accept, that is where you will find the true drivers and reason for having a DR plan.

In the new world of Cloud operated services we may sometimes find more value focussing on Redundancy and levels of Resilience than backup processes and Recovery plans. The drivers and tolerances your organisation has set will help you determine which areas to focus on.

For SaaS and PaaS services you want to to focus on reviewing the service providers compliance statements and audits as well as SLAs to satisfy the businesses defined business continuity targets in the case of disasters. ISO 27001, SOC1,2,3 and PCI are common accreditation to look for in a mature SaaS solution. These accreditation are confirmations of external validation of internal processes meeting the standards stipulated in those definitions. They are only a confirmation that the processes are in place, not a confirmation that they have been tested and verified by the auditors.

You will also want to get confirmation of DR testing that the organisation has done to validate that their processes work. This will often take the form of an auditors report.

Remember that even the best of organisations have blind-spots in their DR plans and real life testing is the only way that they get found and addressed. For a very insightful read about an incident and subsequent recovery by an open and transparent SaaS/PaaS company read GitLabs Postmortem of the Jan 31 2017 database outage.

The big Cloud platforms have focused heavily on attaining compliance with relevant standards. This and the focus put on security practises has been a winning formula to gain the trust required for enterprises to move their applications and data onto their cloud data centres.

Your Disaster Recovery plan will focus on providing certainty about what the business can expect in the event of a disaster and describe where to find relevant contact details and contracts to enforce these expectations when required.

Office 365 and One Drive

The big application platforms pose a particularly poignant dilemma. If the platform you are using for your core business documents has a long list of certifications that your organisation would never be able to achieve on their own and a number of data centres in a number of regions and countries with up-time track records that are close close to 100% then what is the point of planning for a Disaster?

Would it not be a better use of time and money to design for redundancy and leave any potential Disaster Recovery to the platform vendor to deal with? After all they have such plans and the certificates to prove that they have been audited.

For most small and even medium sized organisations that may actually be sufficient. As long as you set retention policies for all Microsoft Office 365 document you will be able to provide redundancy against accidental or deliberate deletion of documents effectively replacing the need to back up your documents to ensure their availability for the retention period your organisation requires.

However, if you run PaaS type of services in Office 365 such as Microsoft Dynamics 365 you will have databases that you must back up. This can be done using the platform it self saving backups that you can restore in locations separate to the region where the database is running. Your DR plan will then capture how you would restore the database and associated artefacts in a new instance of Dynamics 365 in the case of corruption or other type of disaster.

Medium and large organisations as well as organisations with increase legal responsibility will need a more robust DR plan than a small organisation. When planning for DR in a large organisation you may want to consider backing up all your artefacts including the Office 365 documents, emails and SharePoint websites on a different cloud providers disks. You may not be able to restore services to their previous level if Office 365 stopped working all together, or Microsoft closed all your companies accounts but at least you will have an independent copy of your data that you can restore to some extent using various tools.

In conclusion

This article has covered the considerations for how DR planning changes in the Cloud. When writing the plan however the details of how you approach it will depend on the particular services you are planning for. Office 365 and related services are particularly thorny and there is a lot of material to cover to understand your position and desired state. Please leave a comment if you are interested a particular cloud service and we will give you more advice on that particular service.