Cloud SQL: Recovering from Regional failure in 10 minutes or less (MySQL & PostgresSQL)
UPDATE: You can now conduct a cross-region failover in 2 minutes or less! Check out my update to this article here, which details how to use new Cloud SQL features to achieve this recovery time!
Introduction
Google Cloud SQL natively provides a fully managed High Availability service that, when enabled on an instance, provides data redundancy across more than one zone in a single region. This means that if your primary instance goes down, Cloud SQL will automatically fail over to your standby instance without any intervention, and your applications will continue to function as intended.
But what happens if the entire region goes down, taking both your primary and standby instances with it?!
Don’t get me wrong, regional failures are rare in GCP, but they are possible and they do happen on occasion. Even more likely than a regional failure, is a service failure — a situation where a specific service experiences an issue that affects the service in the region, but not the entire region itself.
To protect against this type of failure, a higher level of redundancy is required, one that spans more than a single Google Cloud Region — This configuration is called Cross-Region High Availability.
This article details how to enable Cross-Region High Availability, how to know when to utilize it, and how to orchestrate a cross-region failover. I’ve also provided an automation script that guides you through failover steps, allowing you to recover from a regional incident in less than 10 minutes. Give it a try!
Cloud SQL’s Native High Availability
With a High Availability Cloud SQL instance, the configuration is such that a primary instance exists in one GCP Zone and a standby instance exists in another. Through synchronous replication to each zone’s persistent disk, all writes made to the primary instance are also made to the standby instance (see diagram below).
In the event of an instance or zonal failure, this configuration provides a graceful failover to the standby instance, meaning your data continues to be available to client applications. The IP of the instance does not change, so there is no need to update the application or make any changes.
It’s pretty slick! But if the entire region goes down it won’t prevent you from being impacted.
Enabling Cross-Region High Availability
Cloud SQL’s native High Availability can be “upgraded” to Cross-Region High Availability simply by provisioning a read replica in another region. It is this replica that provides the capability to recover in the region it resides in.
In order to have the ability to fail over to a different GCP Region in a Disaster Recovery scenario, at least one Cloud SQL read replica must be provisioned in a separate GCP Region prior to an incident (see diagram below).
To configure an instance with Cross-Region High Availability when creating it, follow the steps provided below:
- Enable High Availability: be sure to set your backup windows and maintenance windows during this step as well.
- Create a replica in a different region: by default, replica location defaults to the region of the primary instance. Be sure to specify the new region of your choice when creating the replica. You can also create read replicas within your primary region if you need to offload read requests for normal operations. (It is a best practice to have no more than 5 replicas for a single instance)
- Set up automated alerts: configure Cloud Monitoring alerts on instances to notify owners when instances shut down or are offline.
For existing instances, enable High Availability on the instance (instruction here) and then continue with steps 2 and 3 above.
Deciding when conduct a Cross-Region Failover
The decision for whether and when to failover during a DR event should be determined by Recovery Time (RTO) and Recovery Point (RPO) service level objectives. The RTO threshold represents the acceptable down time for a given system in a specified period. The RPO represents the acceptable amount of data loss within that same time period. Each application, or class of applications, should have its own Service Level Objectives (SLOs) for these two metrics and these metrics should be used as a trigger for whether and when to conduct a cross-region failover.
Metrics for both SLOs can be captured and observed in Cloud Monitoring using downtime alerts and replication lag metrics.
But what happens when one of these metrics is exhausted while the other is yet to be fulfilled? A good example of this would be if you hit your RTO threshold (say 1 hour of down time), but your RPO metric hasn’t been met (say 0% data loss).
To deal with this possibility, priority must be set based on business requirements and technical capability as to which of these metrics outweigh the other when it comes to failover decision criteria.
Orchestrating a Cross-Region Failover
Conducting a Cross-Region failover involves taking the following actions in the order they are presented below. Unfortunately, this is a fairly cumbersome process — one that when conducted under the pressure of an outage could result in costly errors.
So, to avoid this, I’ve automated all of it into a simple to deploy bash script! You can read how to deploy the script here. My testing with the script has shown that it can enable application recovery (steps 1–3 below) in less than 10 minutes!
Below are the steps to manually conduct a Cross-Region failover. I’ve added diagrams after every step to visualize what is actually happening in GCP.
1. Promote the Read Replica in the DR Region: promote the read replica in the target GCP Region to a standalone instance via the steps here. Note: This is an irreversible action — once promoted an instance cannot be converted back into a replica.
2. Configure the newly promoted instance: for stability purposes, it is important to configure this instance with High Availability, automated backups, and point time recovery. You can do this by following the steps here. Note: This will cause the instance to restart.
3. Connect consuming applications to the new instance: using the instance’s Connection Name and IP address, reconnect consuming applications to the newly promoted instance.
At this point, your applications have recovered and can begin serving traffic again!
4. Replace the old primary instance and its replicas in the Primary Region: Create a read replica in the original GCP Zone that the primary instance was located in — this enables an eventual failover back to your primary GCP Region. It is also recommended to replace other replicas that were provisioned. The script will automatically identify and replace replicas.
5. Delete the old replicas and primary instance: To avoid instance sprawl and unnecessary costs, delete all replicas of the primary instance. Once replicas are deleted, the primary instance can be deleted as well. Note: In case you do not want to conduct this during the failover process, the sqlFailoverSansDeletion.sh script does not automate the deletion process.
Failing back to your Primary Region
Once the regional issue is resolved, migrating back to your primary GCP Region is as simple as rerunning the steps above manually or via the script. In this scenario, instead of promoting a DR replica, you will be promoting the replica in your Primary GCP Zone — this is why it was important to replace that primary instance with a replica during the DR failover.
The script is parameterized — it allows you to specify any primary instance and a failover target (any replica of the primary instance). Thus, you can use the same script to fail back over to your primary region.
Note: It is recommended that this failover be conducted during a controlled maintenance window since it is no longer an emergency situation.
Conclusion
Although Google Cloud SQL offers a fully managed High Availability option, this feature is limited to the GCP Region that the instance resides in. For mission critical applications, a higher level of resiliency is recommended that takes advantage of the global nature of Google Cloud Platform.
With a little forethought, and use of the automation script provided in this article, an organization can enable their mission critical applications using Cloud SQL to fully recover from a regional incident in 10 minutes or less.