Building Your Disaster Recovery Solution in the Cloud

6 min readJan 14, 2019

Over the years, I’ve spoken with various users about different use cases for the public cloud. Frequently, these users ask me about how they can use Amazon Web Services (AWS) or Microsoft Azure as the DR target for their on-premises workloads. In this blog post, I will provide a high-level overview of the different disaster recovery options using the public cloud.

Why Use Public Cloud for DR?

The traditional approach to DR requires significant investment of time and resources. At minimum, users must consider how they would replicate their primary infrastructure to a secondary site. That secondary site needs to be procured, installed, and maintained. During normal operations, the secondary site will typically be under-utilized or over-provisioned.

The cost of such an investment is beyond the means of many companies. Even for companies with the means, DR is seen as a sunk cost that delivers little return quarter over quarter. However, not having an adequate DR strategy is also something no company can afford.

The public cloud offers a way for companies of all sizes to build DR environments with little upfront costs through a pay-as-you-go model.

Options for Disaster Recovery in the Cloud

The major public cloud vendor offers multiple options for building a DR site using their cloud. AWS, for example, offers four options or scenarios that they highlight in a white paper published in 2014. Each option, which are also available with the other public cloud vendors, comes in at a different price point and delivers a different Recovery Time Objective (RTO) and a different Recovery Point Objective (RPO).

RTO can be defined as the shortest time it takes to resume business and RPO can be summarized as maximal amount of data that a company can afford to lose. Companies can choose the option that best meets their RTO and RPO requirements and budget. In general, public cloud enables customers to build solutions with better RTO and RPO at a lowered cost than a secondary DR site.

Backup and Restore

Traditionally, companies have used off-site backup tapes as their primary means for restoring data in the event of a disaster. This typically involved retrieving tapes from cold storage and recovering data when the primary facility has been restored or when the tapes have been sent to a cold secondary site only turned on when a disaster has occurred.

Companies have started to leverage public cloud storage services such as Amazon Simple Storage Service (S3) and Azure Blob Storage as alternatives to archiving tape to an off-site facility. Not only is this a more cost-effective solution than tape, it delivers better RTO and RPO since the data is already in the cloud where it can be used to launch a DR site on-demand.

Source: White paper: “Using Amazon Web Services for Disaster Recovery” — 2014

There are various approaches for transferring data from the user’s on-premises infrastructure to the public cloud. These include migration tools specific to a particular cloud vendor, as well as third party migration and backup and restore tools.

When a disaster is declared, new instances/virtual machines (VM) can be launched using machine images created from on-premises production servers. If needed, application data is restored from object storage. If any application exists that needs very low RPO and RTO. a replication solution may need to be used in conjunction with the backup and restore option.

This option can be implemented for the lowest cost at the expense of requiring the longest RTO and RPO.

Pilot Light

The Pilot Light option is named after the constantly-on gas heater pilot light that is used to quickly light the furnace. With this approach, a minimal copy of the production environment is maintained in the cloud. Core components whose state must be maintained and updated, such as a production database, run continuously in the cloud and are synced regularly with production. Servers in the cloud can be provisioned but turned off until a disaster is declared. Alternatively server images can be maintained for launching instances/VMs when needed.

Compared to the Backup and Restore option, the Pilot Light scenario offers a better RTO since the core components are already running in the cloud and servers are already provisioned or ready to be provisioned. It also offers better RPO since core services are regularly updated and synced with production. However, the cost will be higher.

Warm Standby

The Warm Standby option requires a scaled down copy of production to be provisioned and run continuously in the cloud. Stateful core components are also updated and synced regularly with production. A subset of servers, found in production, run continuously as instances/VMs in the cloud and can be scaled up as needed.

Compared to the previous two options, the Warm Standby scenario offers a better RTO since the core components are already running in the cloud and critical servers are already provisioned and running. In a disaster, production traffic for critical workloads can be redirected to the cloud while additional instances/VMs are launched to take on additional workloads. The Warm Standby option also offers better RPO since core services are being regularly updated and synced with production. The cost is higher than the earlier two options since more resources are provisioned and continuously running.

Hot Site

Similar to the Warm Standby option, a copy of the production environment runs continuously in the cloud. But in the hot site scenario, a copy of the full production environment runs in the cloud. This allows for immediate failover during a disaster, with the cloud provisioned to run the same amount of workload as production. In addition, if core components are being updated synchronously, then the cloud can be used for production, along with the user’s on-premises infrastructure, in an active-active setup.

This option has the best RTO and RPO since the user is running an exact replica of the on-premises infrastructure in the cloud. As expected, it also has the highest cost, particularly if core components for both the on-premises and cloud environments are being completely synced.

No one approach is the only or best way to creating a disaster recovery solution in the cloud. Typically, the best approach is to combine multiple options and to apply them to workloads based on their importance. The key is understanding the value of your workloads, knowing your options, and making the right trade offs between business requirements, technical requirements, and cost.