Snowflake Cost Optimization: Part I
Snowflake Data Cloud has a pay-as-you go pricing model. There are no upfront costs for licenses or any hardware purchase. There are no costs for infrastructure management and maintenance. You just pay for storage and compute as you go. But it does not mean you can’t be effective in terms of costs. Honestly, you would be silly to ignore the cost part. There are optimization techniques and basic rules on how to operate around your accounts. Snowflake offers a scalable solution, but your cost can also scale to very unpleasant levels.
Let’s have a look at a few features that can impact costs, and which you should have under control. We’ll start with the basics ones and call them basic building blocks of each and every project being built on top of the Snowflake Data Cloud — a basic mantra for every Data Engineer and Architect, if you will. Then let’s have a look at other Snowflake features that can also have a significant impact on costs. Maybe you had no idea that features like Secure Data Sharing, External Tables or COPY command might have an impact on your bill in positive way.
All published recommendations and observations are based on my own experience. Probably not all of them might be relevant for every project, but I believe it is worth knowing about them.
I have divided the blog post into two parts. Today, we are going to focus on optimizing costs for compute.In Part 2, coming soon, we will have a look at storage cost optimization.
Now let’s start and try to save you some 💰!
Snowflake virtual warehouses, or compute clusters in Snowflake, can be the most important and crucial components of your Snowflake accounts with the biggest impact on costs. To be able to execute SQL statements or load and unload data from tables, you need to have running virtual warehouse. More about the virtual warehouses concept can be found in documentation here: Virtual Warehouses — Snowflake Documentation
You are billed in Snowflake credits for running time of each virtual warehouse (per second billing). The amount of billed credits is dependent on running time and warehouse size. There are basic configuration options in relation to virtual warehouses, which you should apply every time to be effective in spending the costs and using the processing time.
Auto suspend and auto resume parameters
This is alfa & omega. Because you are billed for run time, don’t let the warehouses be running 24/7. Use them only when they are needed. The rest of the time they can be suspended. With the auto-resume parameter, the warehouse will be automatically started when anew query is triggered. And obviously, the auto-suspend parameter gives you the opportunity to shut down the WH after defined idle time. Defined idle time depends on your workload. You can start with 1 second and go up to hours. You should be aware of some important things when considering right time for auto suspension:
- When WH is started you are always billed one full minute (60 seconds). After that, billing per seconds starts. It means if you have small gaps between your queries — up to 3 minutes, it does not make sense to set the auto-suspend to a very low value.
- WH cache is cleared when you suspend the virtual warehouse. Using the cache can also have significant impact on your cost and performance so keep in mind using all types of Snowflake caches as much as possible. We will talk more about caches later.
Virtual warehouse start/stop is instantaneous operation. There is no delay or waiting time which gives you great opportunity to design your compute layer according your workloads. Utilize the elasticity!
Experiment with different warehouse sizes
You might be wondering how to find the right warehouse size to have enough computation power and not be over-provisioned. There is no rule or exact guidance. What I would recommend is experimenting. Try to run some typical query on top of representative amount of data and compare the performance. You need to find right balance between time and cost which might be different for various use cases. For running ETL/ELT jobs you would probably go for bigger WH to finish the task as quickly as possible and make data available to consumers. But on the other hand you might have virtual warehouse used by developers where you do not need such huge computation power and developers can wait couple more seconds until their query is finished.
Just be aware that bigger warehouse does not mean more expensive. Have a look on chart below. It shows how long it takes to load 1 billion of records with different warehouse sizes and what is cost for such operation.
You can see that doubling the number of servers halves the run time. Because you pay per-second of compute the price is same. In other words you get your answer 8x faster but for same cost. Keep this in mind when designing your warehouses.
Effective data ingestion
When you are building up the data pipeline there are simple things which can affect the performance and thus the cost as well. Snowflake offers so called clustering keys . It is a mechanism how to organize data in micro partitions in effective way. This helps to performance when data are being retrieved with SELECT statement. Clustering keys are not intended for every table. Only really big tables in sense of data rows can benefit from it. Obviously, there is higher cost behind them. There is a little workaround how you can „simulate“ (in some sense) what clustering keys do. When you insert data into tables, insert them sorted according the dimension which is usually used in joins or for filtration (WHERE clause). Typically it is date and time columns, various ids, etc. This little trick can significantly help to query pruning and scanning of micro partitions when retrieving data from Snowflake. I have been able to lower the query run time about 15–20% thanks to that sorting.
Personally I haven’t used clustering keys yet. We have always done some tests to verify if it brings any benefit but so far for our data volumes (up to billions of records per single table) there was no significant increase of performance compared to higher cost which we would need to pay. Storing ordered data works for us.
Monitor your costs
You do not know, you are wasting money if you don’t monitor your spendings. Have a look on resource monitors and define quotas and alerts for your warehouses. If you have any hard limit for spendings you can even shut down the warehouse when your limit will be reached. Long term monitoring of warehouse activity and spendings will give you better understanding of your setup and provide inputs into another round of cost/performance optimization.
If you are little bit nerdy or you would like to see some dashboards and charts on top of your spendings and overall account activity, then go and check your account usage in Snowflake database. It contains a lot of system views with valuable data which can give you more insights about overall account usage on different levels (queries, tasks, pipes, warehouses, MVs, etc. )
Materialized views — how to use them effectively
You probably know concept of materialized views from other DBMS systems. Let’s just highlight what you should know before using them. Materialized views are automatically refreshed in Snowflake. It means if you have MVs where underlaying data change very often you can pay significant amount of credits just because of the refresh. That’s why I would recommend use MVs only in cases when the base table does not change frequently. If you are considering using MVs please go through the documentation first to be aware of all other aspects.
Caching and impact on cost
Snowflake has different types of caches. All of them can have impact on your costs and performance. If you want to be effective you should know how to use them in the best way.
Snowflake stores a lot of metadata about various objects (tables, views, staged files, micro partitions, etc.) or events (copy command history). They can help you in certain situations. For instance queries like SELECT COUNT(*)FROM TABLE . You can notice there is no virtual warehouse visible in history tab. It means that this information is retrieved from metadata cache and as such it does not need any running virtual WH!
Local cache or SSD cache or virtual warehouse cache
You can see different names for this type of cache. All the names refer to cache linked to particular instance of virtual warehouse. Data are stored locally on each virtual WH. Suppose you have a warehouse called MY_WH. When you run query for first time, data are cached locally. Next time you run query which access some of the cached data, MY_WH can retrieved them directly from the local cache and save some time. You can find what has been retrieved from this cache in query profile. Just be aware that local cache is purged when you turn off the warehouse. Here comes to the stage proper planning of auto-suspend parameter. If you shut down the warehouse and start it again in a minute or two you do not have any data in cache and also you will be billed for full first minute.
Query Result Cache
Last type of cache is query result cache. Every time you run some query, Snowflake stores the result. If you run totally same query within 24 hours you will get the result from query result cache (within mili seconds). There is more rules which needs to be fulfilled in order to use query result cache:
- you can’t use functions like CURRENT_TIMESTAMP, CURRENT_DATE
- you can’t use UDF in your query.
- underlaying data can’t be changed since last execution.
All such things would prevent you from using query result data. If you run the same query within 24 hours, Snowflake reset the internal clock and cached result will be available for another 24 hours. This can be done up to 31 days.
That would be all for the first part. Let’s continue in next blog post which will be dedicated to saving storage cost and performance tuning tips. See you there. 🙋♂️
And if you have any other tips how to save Snowflake credits for compute resource let me know!