Implementing Disaster Recovery on Snowflake

Any business critical function in your enterprise needs to have a strategy for ensuring that operations can continue following some unforeseen event. This strategy is a combination of people, process, and technology and is commonly referred to as a Disaster Recovery (DR) or Business Continuity plan. The events in question can be the result of human error, cyber attacks, or any natural or man-made disaster. This is a broad area of study that has many complexities and dimensions, but from a technology perspective it is crucial that any DR plan be as simple as possible with a minimal amount of manual intervention. Disasters, by their very nature, are unpredictable, chaotic events and we cannot assume that the expertise needed to recover will be available when needed.

The purpose of this article is to describe how Snowflake effectively supports DR scenarios and how cost and complexity is removed from the process.

For background, it’s important to distinguish between High Availability (HA) and Disaster Recovery (DR) as they are often considered together (HA/DR). HA is typically viewed as the set of processes and technology features that ensure uninterrupted, normal operation to a specified level of service. These can include internal redundancies, and failover mechanisms intended to help recover quickly from localized or isolated failures. This can be contrasted to DR which represents a set of technologies and processes designed to help systems recover from more widespread and high-impact failures. In short, HA tries to shield normal operations from the impact of failures, DR focuses on recovering from failures of a greater scale.

There are numerous HA scenarios that are handled automatically within Snowflake and its underlying infrastructure that do not rise to the level of a true disaster. These are mainly local failures that can be handled automatically as a function of the database or through built-in features provided by the cloud provider. Some examples include:

Disaster Recovery plans come in where widespread failures occur that impact an entire region or multiple regions or (in an extreme case) an entire cloud provider. In all cases the goals are the same: return to normal operations as-soon-as-possible to minimize impact to critical business functions. In some industries, particularly highly regulated ones, a DR plan must also consider the possibility that an entire cloud provider is lost or becomes unavailable.

Requirements for Disaster Recovery solutions are often expressed in terms of RPO and RTO. RPO (Recovery Point Objective) refers to the state of the data at the time of system recovery and determines the level of data loss that is acceptable to the business. This becomes a function of how frequently changes are copied or replicated to an alternate location. The time that passes between this replication and a disaster event is the RPO. In a perfect world, changes would be applied to two or more locations in a synchronous manner for an RPO of zero. This would result in no data loss and require no additional processing upon recovery. However, this is largely impractical due to the operational and performance burden it would place on the primary system — all changes to the primary system would need to wait until the changes were committed to all systems. A better approach is to use asynchronous replication where the RPO is set to an acceptable but practical interval during which replication can be executed (ie. every 1 hour). Changes lost between the recovery point and the outage must then be reapplied when service is restored.

The second requirement is RTO or Recovery Time Objective. This is the time interval between the failure and the point at which the system returns to normal operations including the availability of data in its current state. RTO is affected by how long it takes a DR system to assume operational readiness and represents the outage window as perceived by the business community.

The two concepts of RPO and RTO are summarized in the following graphic:

(figure 1)

To illustrate how DR works in Snowflake, let’s take 2 Snowflake accounts. We’ll call one Primary (AcctA) and one Secondary (AcctB). These accounts should exist in separate cloud regions but can also exist in different cloud providers as long as they belong to the same organization. Organizations are sets of Snowflake accounts that can be managed from a central point of control from a cost control perspective and also for replication/failover. An organization is managed by an ORGADMIN role which can also be used to enable replication for accounts within the organization. The procedures for enabling replication are described here:

During normal operation, the Primary account will handle the processing of the critical workloads and house all of our data. Periodically, at an interval determined by our RPO (Recovery Point Objective), we will replicate changes from the Primary to Secondary accounts. The setup can be accomplished via the Snowsight UI or by some simple SQL commands as shown in figure 2. In addition to replication, business functions and applications need to be redirected to the Secondary account when a Disaster event is declared. This occurs by way of Failover and Client Redirect features.

Client Redirect uses a new object type called a “Connection”. A Connection specifies a URL that is referenced by all clients and applications accessing Snowflake. When a disaster is declared, the Connection URL is re-pointed to the Secondary account where all subsequent logon attempts will be sent.

In normal “business-as-usual” mode, workloads execute on the Primary account (AcctA) and clients are pointed to the connection URL. When the command ALTER DATABASE… REFRESH is executed on AcctB, internal processes on AcctA check to see what data has changed since the last refresh. Specifically, this relies on Time Travel to identify new micro-partitions. Since micro-partitions are immutable, the new ones are easy to identify. These are then copied over to AcctB and applied. Customers will typically execute this REFRESH command periodically in a task.

CREATE OR REPLACE TASK RefreshSnowDB 
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
SCHEDULE = '60 minute'
AS
ALTER DATABASE SnowDB REFRESH;
ALTER TASK RefreshSnowDB RESUME;

If nothing has changed since the last execution, this is an in & out operation. It’s also important to note that the target database in AcctB is in read-only mode since it must be guaranteed to be in sync with the primary.

So now trouble happens. There is some incident or event or failure in the primary region that makes continued operations impossible. It is typically an executive management decision to label these events as true disasters vs. waiting it out for troubleshooting and remediation. But if it does rise to the level of a declared “disaster”, all it takes is a simple execution of the two ALTER statements above. When this happens, all connections pointing to URL-X are now directed over to the Secondary AcctB.

At the current time, replication occurs at a database/schema level. That is, only select schema-level objects (tables, views, policies, tags) are replicated. Other objects including account-level objects and settings must be brought over via other methods. Full account replication is currently on Snowflake’s product roadmap.

The examples above also highlight that the primary and secondary regions can be on any cloud providers instantly addressing a major concern for some customers.

So, what does this all cost?

There are 4 dimensions to the DR cost picture that should be considered.

Please visit the following document for other factors to consider when implementing replication on Snowflake:

In closing, Snowflake has built an architecture and a feature set that succeeds in supporting a Disaster Recovery strategy that is not only cost effective, but also very simple to implement and manage without reliance on third parties or hard-to-find specialized skill sets. With the upcoming release of full account replication, the story only gets better!

--

--

Tom Manfredi
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Scalable Data Warehouse/Lake Architectures, Snowflake, Teradata, Performance Optimization, Snowflake SnowPro Core Certification, Teradata Certified Master