Ensuring High Usage Readiness and Disaster Preparedness with Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

6 min readMar 18, 2024

Achieving High Availability and Disaster Recovery Readiness in Snowflake

As product owners, we are constantly challenged with ensuring that our systems can handle peak usage periods, such as holidays, without encountering crashes or performance degradation. Additionally, we must be prepared for unforeseen disasters, whether from human error, like accidental database deletions, or infrastructure failures, such as zone outages. Let’s explore each scenario and how we can effectively manage them.

High Availability

During peak periods, such as holidays or special promotions, the demand on our systems can skyrocket. Our websites and reports must remain responsive and accessible to users, even under heavy loads. To achieve this:

— Scalability: Implement auto-scaling capabilities to adjust resources dynamically based on demand. This ensures that our systems can handle sudden spikes in traffic without compromising performance.

-- Automatic scaling is only available in the Enterprise Edition or higher
CREATE OR REPLACE WAREHOUSE my_wh WAREHOUSE_SIZE=XSMALL 
INITIALLY_SUSPENDED=TRUE;

ALTER WAREHOUSE my_wh
SET MIN_CLUSTER_COUNT = 1,
    MAX_CLUSTER_COUNT = 2;

Remember that autoscaling could help with HA but also incurs costs based on the number of clusters used. Monitor your usage and costs to optimize your setup.

— Performance Monitoring: Utilize robust monitoring tools to continuously monitor system performance and identify potential bottlenecks or areas of concern. Proactive monitoring allows us to address issues before they escalate.

-- Top 50 longest-running queries
SELECT query_id,
  ROW_NUMBER() OVER(ORDER BY partitions_scanned DESC) AS query_id_int,
  query_text,
  total_elapsed_time/1000 AS query_execution_time_seconds,
  partitions_scanned,
  partitions_total
FROM snowflake.account_usage.query_history Q
WHERE TO_DATE(Q.start_time) > DATEADD(day,-1,TO_DATE(CURRENT_TIMESTAMP()))
  AND error_code IS NULL
  AND partitions_scanned IS NOT NULL
ORDER BY total_elapsed_time desc
LIMIT 50;

— Load Testing: Conduct comprehensive load testing exercises to simulate high-traffic scenarios and assess the performance of our systems under stress. This enables us to identify and rectify performance limitations before they impact users.

— Caching Strategies: Implement caching and aggregating mechanisms to reduce the load on backend systems and improve response times. It’s important to recognize that the Snowflake cache is utilized when the query remains unchanged, the data persists without alteration, and the warehouse remains unsuspended.

Disaster Recovery

Disasters can strike at any moment, whether it’s due to human error or infrastructure failures. It’s essential to have robust disaster recovery measures to minimize downtime and data loss in a catastrophe. Key strategies include:
— Regular Backups: Implement automated backup processes to regularly back up critical data and configurations. Ensure that backups are stored securely and efficiently restored during data loss.
— Data Replication: Utilize data replication techniques to replicate data across multiple geographic regions or availability zones. This ensures redundancy and resilience against infrastructure failures or regional outages.
— Role-Based Access Control (RBAC): Implement RBAC policies to restrict access to sensitive systems and data, reducing the likelihood of accidental deletions or unauthorized modifications.
— Disaster Recovery Planning: Develop comprehensive disaster recovery plans outlining procedures for restoring systems and data during a disaster. Conduct regular drills and rehearsals to validate the effectiveness of these plans and ensure readiness.

Snowflake Edition

To access the disaster recovery features, you must be subscribed to the Business Critical Edition. If you’re currently on a lower edition, upgrading is straightforward. Create a support ticket in Snowflake, and our team will assist you in upgrading your subscription. Consider that you can upgrade for the high season and downgrade back just after.

After receiving this email, it took approximately three hours for the account to be upgraded to Business Critical. Since the ticket I created was of low severity, it’s possible that the process could be expedited if it’s urgent for you.

Note: When you upgrade an account to Business Critical Edition (or higher), it might take up to 12 hours for failover capabilities to become available.

Downgrading back to Enterprise after the holidays

If you choose to upgrade your account just for the hypes, consider that downgrading might take 1–3 days.

You will get an email to make sure you would like to downgrade and loose the Business Critical only features:

Before downgrading the account, kindly let us know if any of the following features were activated in the account.
1) tri-secret which is available only for Business Critical accounts.
2) Private link which is available for Business critical and above editions.
If enabled, we have to remove these features from the account, and then we shall proceed with downgrading the account to Enterprise.

Also, as you can guess, the Failover feature will stop working once you downgrade.

Setup the DR

Note: The organization administrator (ORGADMIN role) must enable replication for the source and target accounts.
Log into an ORGADMIN account to enable replication for each source and target account in your organization.

USE ROLE ORGADMIN;

-- View the list of the accounts in your organization
-- Note the organization name and account name for each account for which you are enabling replication
-- With this you would also find your organization name
SHOW ORGANIZATION ACCOUNTS;

-- Enable replication by executing this statement for each source and target account in your organization
SELECT SYSTEM$GLOBAL_ACCOUNT_SET_PARAMETER('<organization_name>.<account_name>', 'ENABLE_ACCOUNT_DATABASE_REPLICATION', 'true');

Setting up disaster recovery (DR) involves three essential steps:

1. Replication: This step involves replicating Snowflake objects to ensure redundancy and availability in case of a disaster.
2. Failover: Configuring the failover account is crucial, as it determines the backup environment that will be activated in the event of a disaster.
3. Client Redirect: Redirecting clients to the failover account ensures seamless continuity of operations, minimizing downtime and disruptions.

Replication

This feature facilitates the replication of objects from a source account to one or more target accounts within the same organization. Objects replicated in each target account, termed secondary objects, mirror the primary objects in the source account. Replication is supported across regions and cloud platforms, ensuring robust redundancy and data availability.

The steps for Replication can be found in Snowflake documentation

Failing over account objects

In a disaster, such as a power outage in a specific region, you may need to initiate your backup to maintain operations. To do this, follow Snowflake’s instructions for setting up a failover account, which can be found in the Snowflake documentation here. These instructions will guide you through configuring failover to ensure seamless continuity during disruptions.

Client redirection

Client Redirect functionality empowers the seamless redirection of client connections to Snowflake accounts situated in different regions. This capability serves multiple purposes, including ensuring business continuity and disaster recovery measures, as well as facilitating smooth migration of accounts to alternative regions or cloud platforms.

Instructions for setting up the Client redirection can be found here.

Summary

This blog post explored the distinction between disaster recovery and high availability, examining how we can effectively prepare our Snowflake account to accommodate both scenarios.

To stay updated on more Snowflake-related posts, follow me at my Medium profile: Eylon’s Snowflake Articles.

I’m Eylon Steiner, Engineering Manager for Infostrux Solutions. You can follow me on LinkedIn.

Subscribe to Infostrux Medium Blog at https://blog.infostrux.com for the most interesting Data Engineering and Snowflake news. Follow Infostrux’s open-source efforts through GitHub.