Cloud SQL: Recovering from Regional failure in 10 minutes or less (MySQL & PostgresSQL)

Published in

Google Cloud - Community

7 min readApr 1, 2020

UPDATE: You can now conduct a cross-region failover in 2 minutes or less! Check out my update to this article here, which details how to use new Cloud SQL features to achieve this recovery time!

Introduction

Google Cloud SQL natively provides a fully managed High Availability service that, when enabled on an instance, provides data redundancy across more than one zone in a single region. This means that if your primary instance goes down, Cloud SQL will automatically fail over to your standby instance without any intervention, and your applications will continue to function as intended.

But what happens if the entire region goes down, taking both your primary and standby instances with it?!

Don’t get me wrong, regional failures are rare in GCP, but they are possible and they do happen on occasion. Even more likely than a regional failure, is a service failure — a situation where a specific service experiences an issue that affects the service in the region, but not the entire region itself.

To protect against this type of failure, a higher level of redundancy is required, one that spans more than a single Google Cloud Region — This configuration is called Cross-Region High Availability.

This article details how to enable Cross-Region High Availability, how to know when to utilize it, and how to orchestrate a cross-region failover. I’ve also provided an automation script that guides you through failover steps, allowing you to recover from a regional incident in less than 10 minutes. Give it a try!

Cloud SQL’s Native High Availability

With a High Availability Cloud SQL instance, the configuration is such that a primary instance exists in one GCP Zone and a standby instance exists in another. Through synchronous replication to each zone’s persistent disk, all writes made to the primary instance are also made to the standby instance (see diagram below).

In the event of an instance or zonal failure, this configuration provides a graceful failover to the standby instance, meaning your data continues to be available to client applications. The IP of the instance does not change, so there is no need to update the application or make any changes.

It’s pretty slick! But if the entire region goes down it won’t prevent you from being impacted.

Cloud SQL’s Native High Availability Architecture

Enabling Cross-Region High Availability

Cloud SQL’s native High Availability can be “upgraded” to Cross-Region High Availability simply by provisioning a read replica in another region. It is this replica that provides the capability to recover in the region it resides in.

In order to have the ability to fail over to a different GCP Region in a Disaster Recovery scenario, at least one Cloud SQL read replica must be provisioned in a separate GCP Region prior to an incident (see diagram below).

Cross-Region High Availability Architecture showing a Read Replica in separate GCP Region

To configure an instance with Cross-Region High Availability when creating it, follow the steps provided below:

Enable High Availability: be sure to set your backup windows and maintenance windows during this step as well.
Create a replica in a different region: by default, replica location defaults to the region of the primary instance. Be sure to specify the new region of your choice when creating the replica. You can also create read replicas within your primary region if you need to offload read requests for normal operations. (It is a best practice to have no more than 5 replicas for a single instance)
Set up automated alerts: configure Cloud Monitoring alerts on instances to notify owners when instances shut down or are offline.

For existing instances, enable High Availability on the instance (instruction here) and then continue with steps 2 and 3 above.

Deciding when conduct a Cross-Region Failover

The decision for whether and when to failover during a DR event should be determined by Recovery Time (RTO) and Recovery Point (RPO) service level objectives. The RTO threshold represents the acceptable down time for a given system in a specified period. The RPO represents the acceptable amount of data loss within that same time period. Each application, or class of applications, should have its own Service Level Objectives (SLOs) for these two metrics and these metrics should be used as a trigger for whether and when to conduct a cross-region failover.

Metrics for both SLOs can be captured and observed in Cloud Monitoring using downtime alerts and replication lag metrics.

But what happens when one of these metrics is exhausted while the other is yet to be fulfilled? A good example of this would be if you hit your RTO threshold (say 1 hour of down time), but your RPO metric hasn’t been met (say 0% data loss).

To deal with this possibility, priority must be set based on business requirements and technical capability as to which of these metrics outweigh the other when it comes to failover decision criteria.

Orchestrating a Cross-Region Failover

Conducting a Cross-Region failover involves taking the following actions in the order they are presented below. Unfortunately, this is a fairly cumbersome process — one that when conducted under the pressure of an outage could result in costly errors.

So, to avoid this, I’ve automated all of it into a simple to deploy bash script! You can read how to deploy the script here. My testing with the script has shown that it can enable application recovery (steps 1–3 below) in less than 10 minutes!

Below are the steps to manually conduct a Cross-Region failover. I’ve added diagrams after every step to visualize what is actually happening in GCP.

1. Promote the Read Replica in the DR Region: promote the read replica in the target GCP Region to a standalone instance via the steps here. Note: This is an irreversible action — once promoted an instance cannot be converted back into a replica.

Notice that the Instance in Region 2 has been upgraded from a replica to a Primary instance.

2. Configure the newly promoted instance: for stability purposes, it is important to configure this instance with High Availability, automated backups, and point time recovery. You can do this by following the steps here. Note: This will cause the instance to restart.

The Primary instance in Region 2 is now configured for Zonal High Availability.

3. Connect consuming applications to the new instance: using the instance’s Connection Name and IP address, reconnect consuming applications to the newly promoted instance.

The consuming application is now using the new primary’s IP and Connection Name.

At this point, your applications have recovered and can begin serving traffic again!

4. Replace the old primary instance and its replicas in the Primary Region: Create a read replica in the original GCP Zone that the primary instance was located in — this enables an eventual failover back to your primary GCP Region. It is also recommended to replace other replicas that were provisioned. The script will automatically identify and replace replicas.

A new read replica has been provisioned in Region 1 to replace the Primary instance. This read replica will be the failover replica to migrate back to Region 1 once the incident is over.

5. Delete the old replicas and primary instance: To avoid instance sprawl and unnecessary costs, delete all replicas of the primary instance. Once replicas are deleted, the primary instance can be deleted as well. Note: In case you do not want to conduct this during the failover process, the sqlFailoverSansDeletion.sh script does not automate the deletion process.

The old primary instance, it’s standby instance, and all of its replicas have been deleted to optimize spend.

Failing back to your Primary Region

Once the regional issue is resolved, migrating back to your primary GCP Region is as simple as rerunning the steps above manually or via the script. In this scenario, instead of promoting a DR replica, you will be promoting the replica in your Primary GCP Zone — this is why it was important to replace that primary instance with a replica during the DR failover.

The script is parameterized — it allows you to specify any primary instance and a failover target (any replica of the primary instance). Thus, you can use the same script to fail back over to your primary region.

Note: It is recommended that this failover be conducted during a controlled maintenance window since it is no longer an emergency situation.

Conclusion

Although Google Cloud SQL offers a fully managed High Availability option, this feature is limited to the GCP Region that the instance resides in. For mission critical applications, a higher level of resiliency is recommended that takes advantage of the global nature of Google Cloud Platform.

With a little forethought, and use of the automation script provided in this article, an organization can enable their mission critical applications using Cloud SQL to fully recover from a regional incident in 10 minutes or less.