Replication/failover in Snowflake and best practices to optimize associated spend

This blog post covers an overview of Snowflake functionality to ensure business continuity in the face of regional or cloud provider wide availability disruptions. It then details the components of costs associated with this functionality and best practices to apply Snowflake’s cost optimization framework and FinOps practices to this domain.

Overview of Snowflake’s replication functionality

Snowflake enables customers to replicate objects such as databases and keep them synchronized across multiple accounts in different regions and/or cloud providers. Replication occurs at a customer stipulated time granularity.

Replication and failover can help ensure business continuity in the face of outages

Customers can also failover to another region and/or cloud provider to continue business operations during outages or availability disruptions. Snowflake provides a URL for clients to connect to Snowflake which automatically redirects to the target account during failovers.

Snowflake replication and failover workflow

Snowflake decouples compute from persistent storage to ensure that each can be scaled independently of the other. Virtual warehouses are the primary compute primitive in Snowflake which provide elastic, scale-to-zero, instant-on, stateless, and fully managed mechanisms to run a variety of customer workloads on multiple cloud service providers. Each virtual warehouse includes independent compute clusters that do not share resources with other virtual warehouses. VMs that comprise a Virtual Warehouse are dedicated solely to that Virtual Warehouse which results in strong performance, resource, and security isolation. Each job runs on exactly one Virtual Warehouse.

Snowflake capabilities to ensure business continuity based on the blast radius of disruptions

As a result, if data is replicated across cloud regions or providers, if/when a disaster occurs in one region or cloud provider, customers can immediately access and start querying data in a different region or cloud provider by spinning up new stateless, instant-on, and scale-to-zero virtual warehouses in the target region(s).

Components of replication costs

Replication costs include data transfer, compute, and storage costs. These are billed to the target account (i.e., secondary/failover site).

Snowflake passes on cloud providers’ charges for data transfers associated with the initial replication and subsequent synchronization between regions.

Snowflake-provided compute to copy data between accounts across regions is also billed to the target. The target account also incurs costs for the background processes that service materialized views and search optimization. Re-clustering is not performed again on the secondary databases, since clustered tables are replicated to a secondary database after refresh operations.

Furthermore, the target account incurs storage costs for the data in each secondary database. If the initial replication or a subsequent refresh fails, the subsequent attempt (if performed within 14 days) can reuse data already copied over.

Best practices to optimize Snowflake replication costs

1/ Gain visibility into Snowflake replication usage

The amount of replication data transferred within a specified date range using can be explored using the REPLICATION_GROUP_USAGE_HISTORY table function and/or the REPLICATION_GROUP_USAGE_HISTORY view.

The following query uses the REPLICATION_GROUP_USAGE_HISTORY table function to view credits used for account replication in the last 7 days:

SELECT start_time, end_time, replication_group_name, credits_used, bytes_transferred
FROM table(information_schema.replication_group_usage_history(date_range_start=>dateadd('day', -7, current_date())));

The following query uses REPLICATION_GROUP_USAGE_HISTORY view to examine credits used by replication or failover group in the current month:

SELECT start_time, 
end_time,
replication_group_name,
credits_used,
bytes_transferred
FROM snowflake.account_usage.replication_group_usage_history
WHERE start_time >= DATE_TRUNC('month', CURRENT_DATE());

The amount of replication data transferred and spend within a specified date range for databases can be explored using the DATABASE_REPLICATION_USAGE_HISTORY table function or DATABASE_REPLICATION_USAGE_HISTORY view.

The following query provides a full list of replicated databases and credits consumed by replication over the last 30 days, broken out by day. Any irregularities in the credit consumption or consistently high consumption are flags for additional investigation.

SELECT TO_DATE(start_time) AS date, database_name, SUM(credits_used) AS credits_used 
FROM snowflake.account_usage.database_replication_usage_history
WHERE start_time >= DATEADD(month,-1,CURRENT_TIMESTAMP())
GROUP BY 1,2 ORDER BY 3 DESC;

The following query shows the average daily credits consumed by database replication grouped by week over the last year. This helps identify any anomalies in the daily average so you can investigate any spikes or changes in consumption.

WITH credits_by_day AS ( SELECT TO_DATE(start_time) AS date, SUM(credits_used) AS credits_used 
FROM snowflake.account_usage.database_replication_usage_history
WHERE start_time >= DATEADD(year,-1,CURRENT_TIMESTAMP())
GROUP BY 1 ORDER BY 2 DESC )
SELECT DATE_TRUNC('week',date), AVG(credits_used) AS avg_daily_credits
FROM credits_by_day GROUP BY 1 ORDER BY 1;

2/ Gain visibility into replication costs of individual databases

The cost for replication for an individual database in a replication or failover group can be calculated using the following methodology:

(<database_bytes_transferred> / <total_bytes_transferred_for_replication>) * <credits_used>

The following query calculates the bytes replicated for databases in a replication group in the last 30 days:

select sum(value:totalBytesToReplicate) as sum_database_bytes 
from snowflake.account_usage.replication_group_refresh_history rh, lateral flatten(input => rh.total_bytes:databases)
where rh.replication_group_name = 'MYRG' and rh.start_time >= current_date - interval '30 days';

The following query calculates the credits used and bytes transferred for replication of the replication group in the last 30 days:

select sum(credits_used) as credits_used, SUM(bytes_transferred) as bytes_transferred 
from snowflake.account_usage.replication_group_usage_history
where replication_group_name = 'MYRG' and start_time >= current_date - interval '30 days';

3/ Periodically review replication history

Furthermore, replication history for specific replication or failover group can be analyzed using the REPLICATION_GROUP_REFRESH_HISTORY table function or REPLICATION_GROUP_REFRESH_HISTORY view. The following query shows account replication history of a failover group in the last 7 days:

SELECT PHASE_NAME, START_TIME, END_TIME, TOTAL_BYTES, OBJECT_COUNT
FROM TABLE(information_schema.replication_group_refresh_history('myfg'))
WHERE START_TIME >= current_date - interval '7 days';

The following query uses the REPLICATION_GROUP_REFRESH_HISTORY table function to view the bytes copied for database replication for a replication group:

select sum(value:totalBytesToReplicate) 
from table(information_schema.replication_group_refresh_history('myrg')) as rh, lateral flatten(input => total_bytes:databases)
where rh.phase_name = 'COMPLETED' and rh.start_time >= current_date - interval '14 days';

The following query uses the REPLICATION_GROUP_USAGE_HISTORY table function to view sum of the number of credits used and the sum of the bytes transferred for replication a replication group:

select sum(credits_used), sum(bytes_transferred) 
from table(information_schema.replication_group_usage_history( date_range_start=>dateadd('day', -14, current_date()), replication_group_name => 'myrg' ));

The REPLICATION_GROUP_REFRESH_PROGRESS view or REPLICATION_GROUP_REFRESH_PROGRESS_BY_JOB table function can be used to understand the progress of a replication or failover group as illustrated in the following query:

SELECT PHASE_NAME, START_TIME, END_TIME, PROGRESS, DETAILS
FROM TABLE(information_schema.replication_group_refresh_progress('myfg'));

4/ Estimate replication and failover spend

In general, the spend for replication is a function of:

  • Amount of table data in primary databases belonging to a replication/failover group (changes due to data loading or DML operations).
  • Frequency of secondary database or replication/failover group refreshes.

You can extrapolate these values to estimate spend based on these parameters to avoid surprises.

5/ Use replication and failover groups with appropriate policies.

To help replicate groups of objects with point-in-time consistency from a source account to one or more target accounts, Snowflake provides two constructs — replication group and failover group.

A replication group allows you to specify which objects to replicate, which regions or cloud providers to replicate to, and how often to replicate. A failover group enables the failover of account objects in a group. You can choose to failover all failover groups or only select ones. Each replication and failover group has its own schedule. Also, data transfer costs are function of cloud provider pricing based on the source and target region/cloud provider.

In the case of an outage in a region or a cloud platform, we recommend promoting the replicated secondary account objects and databases to read-write primary objects.

6/ Carefully choose which Snowflake objects are replicated and failed over to control costs

You can optimize replication costs by carefully choosing which objects such as databases to replicate. Ideally, replicate only the objects needed for your disaster-recovery strategy. Additionally, by optimizing the frequency of secondary database refreshes, you can further optimize costs.

Snowflake allows you to set the frequency for replication at different intervals for different groups of object. You can control costs by scheduling database replication at a time interval based on the needs of your use case — for example, every 30 minutes or every hour.

7/ Choose test failover cadence based on the level of preparedness needed for your use case

We recommend running planned failovers, which are disaster recovery drills to test preparedness, measure recovery points, and time to recovery. The frequency of these test failovers depends on the criticality of workloads.

You can also set up custom automation to artificially inject faults to simulate disasters and other failure conditions during “game days” as a part of a chaos engineering practice. Test failovers during such game days can help rehearse challenging operational situations ahead of such actual events by creating potential failure conditions and monitoring the system to observe how effectively the team and system respond.

Such operations can also help form and test hypotheses of what they expect from the experiment such as “if X happens, <business/technical metric should not exceed [level]>” (e.g., if a cloud region fails during business critical periods, the number of query failures should be <10% in the next 2 hours. After experiment completion, you’ll need to evaluate if your workload’s resilience is in line with business and/or technical expectations through the use of Snowflake’s replication and failover capabilities.

8/ Replication/failover can be used to migrate your account during M&A activity or changes in cloud strategy

We recommend using Snowflake’s replication/failover functionality to move your Snowflake account to a different region or cloud platform (for reasons such as mergers and acquisitions, or changes in cloud vendor strategy such as embracing multi-cloud) without disrupting your business.

Conclusion and recap

Snowflake’s FinOps framework includes three primary pillars of Visibility, Control, and Optimization. These have been applied to the replication & failover capability to formulate the best practices to optimize spend. These best practices include: 1/ Gain visibility into Snowflake replication usage, 2/ Gain visibility into replication costs of individual databases, 3/ Periodically review replication history, 4/ Estimate replication and failover spend, 5/ Use replication and failover groups with appropriate policies, 6/ Carefully choose which Snowflake objects are replicated and failed over to control costs, 7/ Choose test failover cadence based on the level of preparedness needed for your use case, 8/ Replication/failover can be used to migrate your account during M&A activity or changes in cloud strategy

--

--