Let’s talk about Disaster Recovery to the Cloud

Published in

Google Cloud - Community

5 min readMar 25, 2024

One of the most common use cases for cloud is using cloud as a disaster recovery site. These conversations with customers oftentimes go down a tooling and technology rabbit hole because for years customers have relied on tools to provide an illusion of a well built disaster recovery plan. I decided to define disaster recovery in order to confirm that it’s never defined as a set of tools or technologies. That’s because at the end of the day the term disaster recovery is a business, not technical term.

Now that we’ve established that disaster recovery is an umbrella business term for regaining access and functionality to your IT infrastructure, let’s clarify one more thing. A disaster recovery plan is only initiated if and when there is a loss to that IT infrastructure. That means if you design that infrastructure in a highly available fashion it is less likely that you need to initiate the DR plan. I bring this up to ensure we’re not conflating HA (high availability) with DR (disaster recovery), your environment should be as HA as possible, budget withstanding, but also have a DR plan if and when your infrastructure fails… because things happen.

I’ve spoken with countless customers, in my 4+ years at Google Cloud, about how they can successfully deploy a disaster recovery plan from their on prem data center to Google Cloud. I’ve learned countless things during these conversations, but arguably the most important is to identify how they would like their workloads to recover in the event of a disaster. Better said, would you want to recover your workloads in a VMware environment (Google Cloud VMware Engine) in the cloud or in a native hypervisor environment in the cloud (Compute Engine). I strongly recommend against the idea of trying to modernize during a DR event because it adds additional and unnecessary complexity. Along those same lines, be mindful that making changes to your operational model while a DR event has happened may not be best for your organization. I prescribe to the KISS principle, especially when designing a DR plan. Once we’ve defined the target environment, it allows us to narrow down the capabilities of what can be offered for customers in Google Cloud; avoid boiling the ocean.

The second most important data point is, what is the customer’s current DR plan and what are the recovery point objective (RPO) and recovery time objective (RTO) described in the plan. Understanding both the RPO and RTO can be helpful when identifying what tools we should use for building the “in cloud DR” architecture. For example if the customer’s RPO is “in the hours” (1–10 hours) and their RTO is “in the 10s of hours” (24 or 48 hours) using backups and shipping those backup copies to the cloud can be a cost effective tool to build a DR plan.

Now that we’ve defined some terms and we’ve discussed what are some data points we want to collect, let’s talk about the options. Typically when IT teams are building DR plans for their workloads they use two general types of technology, 1) Replication 2) Backup. Inside of the general replication tech we see three additional forms of replication 1) SAN Replication 2) VM Replication and 3) Application Replication. We can use all four of these tools to build a DR plan that both meets the customer’s RPO and RTO objectives while also meeting their budget. Let me give you an example below.

A customer has the following on premises inventory:

400 VMs on VMware
300 Prod — needing 1hr RPO / 4hr RTO
100 Non-Prod — needing 12hr RPO / 24hr RTO
10 SQL Server Databases running on bare metal (non virtualized)
100TB of NFS/SMB data on a NetApp array

In a very common scenario like the one listed above, we’d start our solution-ing by defining the Google Cloud Services that best fit the customer’s requirement. The VMs listed above are running on VMware and a majority of those VMs require pretty tight RPO and RTO. Because of this we’d look to leverage Google Cloud VMware Engine for the VM workloads. Now for the database workloads, those are currently non virtualized and because SQL Server has several ways that it can build HA/replication pairs (like Always On Availability Groups) we can look to leverage Google Compute Engine for those workloads. Lastly, we have a significant amount of file storage data currently hosted on a NetApp array today. Because of Google’s partnership with NetApp we can support live replication (SnapMirror) into Google Cloud using NetApp Cloud Volumes OnTap (BlueXP).

We’ve identified potential Google Cloud services in the first go around (VMware Engine, Compute Engine, BlueXP), now let’s talk about what tools we would use to facilitate the DR plan. For the VMs in production that require lower RPO and RTO, we know they plan to land on GCVE, because of this we can use vm replication technology from VMware SRM, Zerto or Veeam. The VMs in non-prod that require a less stringent RPO and RTO, we can easily leverage backup tooling such as Veeam, Dell EMC, Rubrik, Commvault, etc to facilitate their protection to Google Cloud Storage. These workloads would be restored from backup at time of DR, but because the RTO is “in the days” this should not be a problem and reduces the overall cost of the solution. For the SQL Server databases we’d leverage SQL Server replication with Always On (as an example) and run a copy of that database in Google Cloud Compute Engine. Lastly, for the NetApp filer data, as mentioned above we’d use SnapMirror and BlueXP deployment in Google Cloud.

Before closing I’d like to add, that this is just merely one example of how a customer would look to protect their on premises workloads on prem and establish a DR plan in Google Cloud. However, hopefully this gives you an idea of the many ways you can go about building a DR plan in the Cloud for workloads currently living on prem.

Let’s talk about Disaster Recovery to the Cloud

Written by Andres Vigil