Disaster Recovery on Google Cloud: An Overview

Get Cooking in Cloud

Priyanka Vergadia

Published in

Google Cloud - Community

6 min readOct 28, 2019

Introduction

“Get Cooking in Cloud” is a blog and video series to help enterprises and developers build business solutions on Google Cloud. In this series I plan on identifying specific topics that developers are looking to architect on Google cloud. Once identified, I create a mini series on that topic. If you are interested in the prior mini series covered, checkout this.

In this second miniseries I will cover Disaster Recovery on Google Cloud. Disasters can be pretty hard to deal with when you have an online presence. In the next few blogs, we will elaborate on how to deal with disasters like earthquakes, power outages, floods, fires etc. Here is the plan for the series.

In this article, we will define some Disaster Recovery terms that are instrumental to the DR planning. So, read on!

What you’ll learn

Important DR terms
RTO — Recovery Time Objective
RPO — Recovery Point Objective

Prerequisites

Basic concepts and constructs of Google Cloud so you can recognize the names of the products.

Check out the video

Disaster Recovery Overview Video

What is Disaster Recovery (DR)?

When a natural disaster happens, we you need to make sure the impact on your business is minimal and for that, you need a robust Disaster recovery (DR) plan!

A disaster typically means a service-interrupting event. So, Disaster Recovery is the amount of impact a business can take during a disaster. It is a subset of Business continuity planning.

Key DR terms

RTO or Recovery time objective is the maximum time your application can be offline. This usually depends on the SLAs you offer to your customers. An SLA is a promise made by you as a service provider, to your consumers, about the availability of your service and the ramifications of failing to deliver the agreed-upon level of service.

RPO or Recovery point objective is the maximum amount of time during which the data might be lost.

Typically, smaller RTO and RPO values mean that the application must recover quickly from an interruption.

How quickly a system can recover after a disaster is defined by High Availability and disaster recovery patterns. Also known as HA and DR patterns.

Understanding DR Patterns

Let’s consider a scenario. I am making some cakes and cookies for a party, which requires a mixer. I am on my first batch of cookies and the mixer starts to make some weird noise. The manual says that the mixer will fail with such a noise, so I need to do something to carry on with the party preparations. I have three options:

Option 1: I can call the mixer company to come fix it — which is obviously going to take time and won’t really work best given the party is today

Option 2: I can try to fix it myself based on the instructions in the manual. This will mean a small pause in my preparation but WILL bring me back on track a little quicker than waiting for the Mixer repair person

Option 3: I could keep going at a slow mix option where I can’t hear the warning noise. In this case, I have to slow down, but the mixer is still working so I can continue on and get it fixed later. This option would definitely have less impact on my party preparations.

With that understanding, let’s review our options in the DR terminology now:

If we call the mixer company for a fix, we have to stop making cake till they turn up & fix it, it will be slow and time consuming but eventually we will be back & working — so this scenario is closest to what we could call a Cold DR pattern
If we try and fix it ourselves, it’s a bit faster than the cold pattern but since we still need to pause making cake in order to fix the situation it is slower than normal — so this scenario is closest to what we would call a warm DR pattern
And the option where we continue working at slow speed and decide to fix later is the most efficient as I am still making cakes without stopping or pausing and is closest to what we would call a Hot DR pattern.

Moral of the story, we pick a DR pattern that makes sense for the business at hand, In this case it was a really serious business of hosting a party!

Using Google Cloud for Disaster Recovery

If you use Google Cloud for DR it can greatly reduce the costs that are associated with achieving both your RTO and RPO values as compared to fulfilling those requirements on-premise. For example: traditional DR planning requires you to account for a number of requirements, including capacity, security, network infrastructure, support, and bandwidth.

Google cloud has several features that help bypass most of these complicated factors and reduces the cost of managing a DR solution. Global network, redundancy, scalability, security and compliance are few such factors. Keep following the series for more on this!

Best Practices for Planning DR Strategy

First and the most important step, you need to define our RTO and RPO values, because those would indicate an appropriate DR pattern.
Then, make sure that you have a full end-to-end recovery plan. Just backing up and archiving data won’t be enough.
Make the tasks as specific as possible, so when the time comes, to execute the plan it does not just say “Run the restore script”. From where? What is the command? such information should be in the instructions.
Implement control measures
Monitor and send alerts when something destructive happens, like spikes in traffic or deletion of data.
Prepare your software
Verify that you can install our software from the source or from a preconfigured image.
Make sure you have the appropriate licenses for deployment.
Your Continuous deployment toolset is an integral part of your application deployment, make sure it is available at the time of disaster to recover our environment.
You should not forget about security and compliance! Configure security the same for DR and production environments.
And last but most important step, make sure your DR plan works! maintain multiple paths for data recovery and test it regularly.

Conclusion

Whether you are a small blogger looking to grow your community, or a huge, multi-scale application, you need to protect your application from going down during a disaster. Hopefully this has been a helpful overview of Disaster Recovery. Whether your application is deployed on-premises or on Google Cloud and whether you have small or big RTO and RPO values, stay tuned for upcoming articles, where you will learn to set up a DR pattern that makes sense for your business.

Next steps

Follow this blog series on Google Cloud Platform Medium.
Reference the DR solutions.
Follow Get Cooking in Cloud video series and subscribe to Google cloud platform YouTube channel
Want more stories? Check my Medium, follow me on twitter.
Enjoy the ride with us through this miniseries and learn more about more such Google Cloud solutions :)