CloudSQL: Cross Region HA just got easier… and whole lot faster!

Published in

Google Cloud - Community

8 min readSep 29, 2022

TL;DR

This article serves as an update to my April 2020 post. It describes how to incorporate 2 new Cloud SQL features, HA Replicas and Cascading Replication, to better architect for Cross-Region High Availability. I also provide an updated automation script to facilitate the failover process during an incident. My testing suggests that these updates will enable you to recover from a regional incident 5X faster than my previous approach — in other words, in 2 minutes or less! Give it a try!

Introduction

In April 2020, I wrote an article explaining how to architect Cloud SQL instances for cross-region high availability, and how to orchestrate a cross-region failover in an automated and controlled fashion. My original claim was that with the right Cloud SQL architecture and automation in place, you could fully recover from a regional outage in a different GCP region within 10 minutes of starting cutover procedures.

However, Google recently, and quietly, rolled out 2 powerful new Cloud SQL features that have invalidated my article’s claim... these new features have further simplified the cross-region failover process and my testing suggests that they have reduced failover time from 10 minutes to just 2 minutes or less!

Note: This article is focused mostly on the new features and updated cross-region architecture, as it compares to my previous blueprint. Please read my former article to understand the limitations of Cloud SQL’s native high availability and why cross-region HA is important, and for a framework for assessing when to actually initiate a cross-region failover.

Quickly Revisiting Cloud SQL’s Native High Availability

In my previous article, I provided a detailed overview of Cloud SQL’s high availability features and architecture. To recap:

Yes — Cloud SQL does have native high availability features. Cloud SQL provides an option to deploy a highly available instance at creation time. Google will manage the replication and failover of that instance without any interaction from you or impact to your applications. Pretty sweet! You can see how this failover works in Google’s documentation here.
But — that high availability architecture is zonal, not regional. This means if an entire region is affected by an outage, both instances of your database are down with it.

Thus, for mission critical Cloud SQL workloads, it is my recommendation that you architect for the ability to fail across GCP regions as well. Regional outages don’t occur often, but when they do, you’ll be up and running in minutes, while your competitors are sitting idle.

The Killer New Features Improving Cross-Region Failover

In the summer of 2022, Cloud SQL quietly launched 2 incredibly important features for improving high availability for MySQL and PostgresSQL workloads: HA Replicas & Cascading Replication.

Killer Feature #1 — HA Replicas

This feature lets you configure your Cloud SQL read replicas as highly available instances. Like with a Primary HA instance, Google will manage the zonal replication and failover of the instance without changing its IP.

Why this matters: After a DR failover to another region, the new Primary instance, which is a promoted replica, needs to be enabled for high availability to match the original cluster topology. This process alone accounted for 5–6 minutes of the 10 minute failover time in my original article! By configuring your failover replica with HA prior to failover, you eliminate this process, and downtime, from your failover procedure.
How to use the feature: You can enable this setting on the read replica, during or after creation, in the Cloud Console the same way you would for a Primary instance.

Comparing legacy replicas vs HA Replicas

Killer Feature #2 — Cascading Replication

This feature lets you add read replicas to any existing replicas, essentially letting one read replica act as a source of replication for another read replica. Said differently, it allows replicas to be daisy chained together for replication purposes.

Why this matters: Prior to this feature, all read replicas would be dependent on the Primary Instance for replication. Thus, after conducting a regional failover, which orphans the original Primary instance, any additional read replicas beyond your cross-region replica would need to be recreated, and any applications accessing those replicas would need to be updated with new connection strings. Now you can configure your downstream replicas to “cascade” off of your cross-region read replica, so replication remains intact after a cross-region failover.
How to use the feature: You can create a Cascading Replica by navigating to your cross-region DR Read Replica in the Cloud Console, selecting “Read Replicas” on the left hand menu, and choosing “Create Read Replica.”

Comparing the impact of cross-region failover on downstream replicas using Traditional Replicas vs Cascading Replicas

Incorporating These New Features Into An Existing Architecture

When I wrote my original article, I used a common Cloud SQL architecture with the following components:

a HA-enabled Cloud SQL Primary instance for read/write
a read replica in a separate region for downstream read-only applications (note: this replica could be in the DR region, as shown, or in a separate region, but it cannot reside back in the original primary region due to cascading replication limitations.)
a cross-region read replica for Disaster Recovery

Conveniently, the new Cloud SQL features can be incorporated without any major changes to the architecture. Here’s a look at the two architectures side by side — take note of the subtle differences.

Legacy Cross-Region DR Architecture for Cloud SQL (left) vs New DR Architecture with HA and Cascading Replicas (right)

To achieve the updated architecture on the right, I’ve simply done the following:

I updated the existing DR Read Replica to be a Highly Available instance (using the HA Read Replica feature).
I recreated the read replica for read-only applications as a Cascading Replica of the DR Read Replica — notice the change in the way replication flows in the new architecture versus the old one.

The New Features in Action

You’re probably wondering what this process actually looks like and how these new features speed up recovery — let’s walk through a regional failover scenario to reveal the answer!

As a baseline, let’s start with familiarizing ourselves with how the cross-region failover process worked prior to HA and Cascading replicas:

Region 1 is affected by a Cloud SQL Outage which takes the primary instance offline.
During failover, the DR replica is promoted to Primary, which severs the replication from the original primary instance — (1–2 minutes).
The DR instance is then configured for High Availability, which requires a restart — (5–6 minutes).
Once complete, the Read/Write applications are updated to connect to the DR instance, completing our cutover.
Since the original Primary Instance is no longer the source for new writes, a new read replica connected to the DR instance must be created for our read-only applications — (3–5 minutes). Once complete, read only applications are updated to query the new replica.

Total Recovery Time: 9–13 minutes

Now that we have a baseline, let’s see how incorporating HA and Cascading replicas into our architecture can greatly simplify our cutover and improve our recovery time. Below is the same scenario as above, but using our updated architecture:

Region 1 is affected by a Cloud SQL Outage which takes the primary instance offline.
During failover, the DR replica is promoted to Primary, which severs the replication from the original primary instance — (1–2 minutes).
Since the DR instance is already configured for high availability via the HA Replica feature, there is no additional need to upgrade and restart it. Thus, Read/Write applications can be connected to the DR instance immediately after promotion.

Lastly, since the read replica for our read-only applications was already configured as a Cascading Replica of the DR instance, it is already utilizing the that instance as a source for replication. Thus, the replication flow is uninterrupted by the promotion of the DR instance to primary — it will continue to replicate future writes to the new primary after failover without intervention.

Total Recovery Time: 1–2 minutes

Don’t Forget About Automation!

In addition to incorporating the new Cloud SQL features, you must have some automation in place to efficiently facilitate the cutover process to achieve the 2 minutes or less recovery benchmark. Outages are stressful situations. Automation will not only help expedite the failover process, it will help prevent costly errors and mistakes.

At a minimum, this automation should handle the promotion of the DR instance, providing its connection details, and recreating a replica in the primary region for when you want to move back. However, your automation could also incorporate your RTO and RPO policies to automatically trigger failovers and then leverage your application pipelines to automatically update connection strings — that would be extra sweet!

Luckily for you, I’ve written an automation script that you can use as a starting point! I’ve also included a second script to help you fail back to your primary region once you’re ready to do so. I highly recommend that you test these scripts often and run practice scenarios in test environments to improve your execution and recovery time.

Conclusion

Cloud SQL recently rolled out 2 exciting new features for MySQL and Postgres that make cross-region failover significantly faster and easier! By incorporating these features, HA Replicas and Cascading Replicas, into your Cloud SQL architecture and adding a little automation, you can now reduce your failover time in a regional disaster to just 2 minutes or less.

I highly recommend that you revisit your architectures and failover plans for your high priority Cloud SQL workloads to incorporate these 2 new features and my updated automation scripts, so that you can ensure you’ll be up and running when your customers need you and while your competitors are sitting idle.