Hashmap has been focusing on delivering the simplest, most cost-efficient data solutions for the better part of a decade. Because of that focus on outcomes, we have lately found ourselves doing more and more work on the Snowflake Cloud Data Warehouse.
It is tempting to think that Snowflake is simply a better version of some previous warehouse solution and treat it like you are used to treating other solutions. I know I was guilty of thinking:
Oh! Snowflake is like if Apache Hive worked for BI and wasn’t a massive pain and wasn’t expensive to maintain!
While this is true, it completely misses how incredibly innovative and disruptive Snowflake is. This way of thinking is too limiting.
If you want to get the most out of your Snowflake instance (and your analytics engineering teams), read on to learn the 5 things I wish I knew before learning Snowflake.
1: Warehouses are Cattle, not Pets
I often see new Snowflake users set up their virtual warehouses to mimic an existing environment or predefined set of activities (
DATASCIENCE_WAREHOUSE). These warehouses are defined by a central authority and sized by an admin for peak demand. All users hit the same warehouse, even in production environments, and must ensure enough horsepower to keep the prod pipelines going in addition to adhoc querying.
This way of thinking is limiting and is costing you money. By having such broadly defined warehouses, you’re limiting your ability to right-size a warehouse for a granular set of activities, you lose visibility into the main drivers of cost on your Snowflake instance, and (worst of all) you make production work vulnerable to running out of resources.
This older way of thinking about compute resources made sense when your server was a precious commodity installed lovingly in your data center. It had only the resources it had and both production and development workloads needed to be supported. You likely named this pet server something clever (I’ve seen everything from Greek Gods to Game of Thrones Families) and hoped no one would bring the service down with a poorly written regex or greedy dashboard refresh configuration.
Those days are gone with Snowflake. You no longer have warehouse pets, you have warehouse cattle.
Cattle do not need specialized attention or hand-delivered care. You can create and destroy warehouse clusters as often as you need to (the cattle analogy gets a little medieval at this point).
Multi-terabyte ingestion job? Grab a 4XL monster and get that ingestion done now! Just playing around with the Python SDK? Better make that a cheap dedicated XSmall warehouse that you can experiment with.
You have the flexibility to grant everyone and each pipeline their own series of warehouses to ensure you’re always using the right tool for the job. Embrace the freedom and say goodbye to your goofy
TARGARYEN dev cluster!
2: Storage is cheaper than Compute
Let’s take a quick field trip to the incredibly useful Snowflake Pricing Page.
For the enterprise customers I work with, the $3 compute cost and $23 storage costs are the most common.
While it’s difficult (or impossible) to make a credible comparison between storage and compute, I have two ways to put these costs into context.
- The Hashmap Snowflake instance spent about $750 in July 2019 on compute ($2 rate) and about $0.17 in storage ($40 rate).
- For typical enterprise deployments that I see, the cost to run the smallest possible virtual warehouse for an hour is 100 times as expensive as storing 1TB of data for an hour ($3.00 vs $0.03)
Of course, Hashmap will always have much more compute activity than storage activity because our instance is for training and development, not production warehousing (so we clean up our large data sets pretty quickly). We’ve discussed our cost monitoring in detail in a previous blog:
However, these numbers should give you a sense of the scale we’re discussing. When working with Snowflake, your first and last concern should be how to conserve compute credits.
In practical terms, this means you should:
- denormalize on write, not on read (no complex views for BI)
- avoid pointing consumption to continuously-updated tables to maximize cache (periodically cloning your table and consuming the clone is a good trick for handling this)
- Use Table Streams on raw ingestion tables to reduce expensive CDC computation (which saves warehouse costs by running for shorter times)
In general, always ask yourself if you can save a future computation by storing your data in a new way. If the answer is yes, you might be able to save some money.
3: SQL Everything
As a consultant, I often provide guided instruction for Hashmap’s various clients. For most technologies we work with, this means extensive screenshots and documentation (see my guide for setting up SSO in Azure for Enterprise Applications as an example).
Graphical interfaces can obviously make setting up SSO (or any number of tasks) easy to do, but they become difficult to automate or repeat. Luckily, Snowflake doesn’t make us choose between easy and repeatable thanks to its support of SQL configuration.
Most things I can do quickly in the Snowflake UI can also be done repeatably with SQL. I actually cannot think of a single thing that I can do in the UI that isn’t doable in a SQL script.
This seems small at first, but it is huge when you:
- setup CI/CD pipelines for code deployment
- establish templates for repeatable processes (like I have for user creation or BI tool setup)
- lose your only DBA to another company and are left wondering how the production permissions were set up (I see this all the time)
Having full SQL control of your Snowflake environment is a game-changer that you’ll miss when moving to other solutions
4: Cache Money
Everyone stay calm while we take a quick (simplified) look under the ol’ Snowflake hood:
This picture has proven particularly handy when teaching Snowflake to new users. The major parts to concern yourselves with are below:
- storage and compute are separate
- the Raw Data Cache (in orange) stays fresh as long as underlying data doesn’t change and your warehouse stays on and unmodified.
- the Result Set Cache (in vibrant cyan) stays fresh as long as the underlying data doesn’t change AND you submit a word-for-word identical query as a previous query within 24 hours of the original query (this resets every 31 days, but no big deal).
Once you understand the value of these two caching mechanisms, you’ll find yourself designing your Virtual Warehouses, BI-tables, and client connections differently.
For highly interactive queries that vary unpredictably, you’ll find that you’ll get better performance by configuring your virtual warehouse to delay suspending for longer (keeping that raw data cache fresh). This is common for data science work or other exploratory analyses.
For production BI and other highly-predictable querying, focus on hitting the Result Set Cache as often as possible. This cache is so powerful because it is lightning fast and entirely free (it will not turn on any warehouses). The cache does not have a published size limitation and works across users.
If your target data has high churn, I also recommend working off of a clone of the data to prevent changing values from invalidating either cache. For BI, use a scheduled task to periodically clone a continuously-updated source table and point your connections to this clone. For more exploratory work, cloning on-demand works perfectly fine.
Read more about cloning here to boost your cache performance. Clones are nearly instant and do not incur additional storage costs.
5: Never Hesitate to Automate
This may be the most powerful consequence of Snowflake’s commitment to SQL-based configuration. When everything you can do is expressable as SQL, you can automate the most mundane tasks of working with a data warehouse.
Let’s take a quick look at all the different ways to programmatically access Snowflake:
These connectors cover the most popular programming languages. This gives you the freedom to use any orchestration system you like to submit SQL strings to your Snowflake instance.
For example, instead of manually creating new user accounts, create a Slack app with Python that allows users to request their own account creation. Rather than hounding your users yourself about naming conventions, build a simple Python Function in AWS Lambda that emails the creators of objects that violate your conventions (
If you’d prefer to stick to native Snowflake, you can do some data pipeline automation using Tasks and Snowpipes as well.
A Snowflake Task allows you to schedule a SQL DML statement to run with a specified Virtual Warehouse. You could also execute a stored procedure for any DDL. This is useful for processing incoming raw data (like JSON files) into consumable tables for analysis.
A Snowflake Snowpipe is an internal object for continuously ingesting data from external stages (S3 / Blob / GCS). It responds to files as they arrive in the stage for automatic ingestion using Snowflake-managed resources (no explicit warehouse definition required).
You can build flexible automated processing pipelines directly in Snowflake by combining a Snowpipe for raw data ingestion with a Task for processing transformations (maybe check out Snowflake Streams to add CDC to your source tables).
These five things took time for me to appreciate but are now core to how I build solutions. There are still plenty of things I didn’t cover here that I encourage you to explore on your own (did someone say data sharing?).
Randy on Snowflake
Below are a few of my other recent Snowflake stories as well…
VIDEO: Artisanal Data Pipelines | Best Practices for Snowflake External Staging in AWS and Azure
Ingestion is the lifeblood of your Snowflake Cloud Data Warehouse. Make sure that you’re spending less and doing more…
30 Second Snowflake Cloud Data Warehouse Cheat Sheet
At Hashmap, we’re constantly looking for ways to provide our clients with a simpler and quicker path to business value…
10 Minute Beginner’s Guide to Snowflake Cloud Data Warehouse
We kicked off the OKC Snowflake User Group Meetup a couple of months ago in order to assist a rapidly growing group of…
Snowflake Partner Connect — 3 Steps for Connecting to Fivetran for Data Pipelining
A Quick Start Guide for Building Snowflake Connections from Scratch
How to Improve Cloud Cost Monitoring — Snowflake + Tableau
A Guide for Using Snowflake‘s Account Usage Dataset to Create Automatically-Updating Cost Monitoring Dashboards
Amplifying Outcomes with Snowflake
Quickly Build and Use a Snowflake Cloud Data Warehouse with Stitch, PowerBI, and a Kaggle Dataset in AWS
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
Randy Pitcher is a Cloud and Data Engineer (and OKC-based Regional Technical Expert) with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers.
Be sure and connect with Randy on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes or to schedule a hands-on workshop to help you go from Zero to Snowflake.