Snowflake
Published in

Snowflake

Snowflake Cost Optimization Part II

This is the second part of the series related to saving Snowflake credits and cost in general. You can find the first part dedicated to compute cost saving here.

Storage cost optimization

Even though the biggest part of your bill will be for compute it is worth to think also about your storage cost. Snowflake offers features which can help you keep your storage on reasonable level.

Zero copy cloning

No more CTAS statements for creating development environments with production data or creating of backup/testing tables! Thanks to Snowflake architecture — immutable micro partitions you can create object clone which is just a logical pointer to same micro partitions as original object and thus there is no additional storage needed for cloned object. Suppose you have table with 10TB+ in size, moving data from one place to another one would require also running warehouse for some time. Thanks to cloning you have just single copy of the data until that time when you modify the data in original or cloned object. Then of course the updated data needs to be stored again. This is really effective, fast and cheap way how to spin up development or testing environments. It is purely metadata operation. Not all Snowflake objects can be cloned. For example external tables or internal stages can’t be cloned. You can find more on cloning here: Cloning consideration

Time travel

This is nice feature to have a possibility to look on data in any given time point in the past. You can go up to 90 days back (Enterprise edition). It is just a one day on standard edition of Snowflake. Default value is 1. You can modify it according your needs. There is an object parameter called DATA_RETENTION_TIME_IN_DAYS which allows you to define time travel on DB, schema or table level. Logically: more days available for time travel → higher cost for storage. For me the default 1 day value works. More on time travel here: Understanding & Using Time Travel — Snowflake Documentation

Data lifecycle in Snowflake

Use transient and temporary DB objects

Another feature which can help you with storage cost optimization. Try to utilise transient tables, schemas or DBs for lower environments. Transient objects do not have fail-safe period → cost save. Time travel could be max 1 day. If you define db or schema as transient then all tables inside will be transient by default. If you need to store just some temporary data, available just within the session (e.g. ETL working data) you can even try to use temporary tables which only exist within the session which they were created in.

Generally, cost save behind using this non permanent object types is in omitting fail-safe and time travel storages.

Utilize External tables

External tables have sense in some scenarios and they can save you both storage and compute cost. I use external tables for use cases where:

  • there is a lot of data (billions of records per day/week/etc.)
  • those data are not used frequently

It does not make sense to load billions of records everyday if they will be used on ad hoc base or couple times per month. If data are already offloaded to some external stage (S3, Azure Blob storage) then it is better to define external table on top of the stage. You will save storage for storing data again in Snowflake and you will also store compute which would be used for data ingestion into Snowflake. Thanks to external tables you will utilize power of Snowflake computation engine on external storage.

Utilize secure data sharing

Last tip for saving storage cost is related to data sharing. I am sure everyone remember how data have been and probably still are shared in Data projects. You probably export the data to some external storage and then transfer them via SFTP to your customers, partners or coworkers. Or different approach where you have some sharing tables/schemas or data marts with portion of the data related to particular partner account. In all such scenarios it means you have multiple copies of same data on different places.

We can do it more wisely with Snowflake and its feature called Secure Data Sharing. This is one of my most favourite feature because there are different types of situations how this feature could be used including increasing the security of your application.

Thanks to data sharing you have single copy of your data which you can share with anyone. If the data consumer does not have Snowflake account you can even create a reader account for him. Secure data sharing makes maintenance of shared data easy — you maintain everything from single place. No more scattered logic between DB, external storage, transfer script, etc. Now grant or revoke an access is matter of single SQL statement. Auditing of the access can be again easily done with SQL.

If you do not use secure data sharing yet I really recommend to try it out. Not only because of possible cost save but because it makes so many things easier.

Data Sharing at high level

Query performance tuning — is it necessary?

You might know that from other DBs. I have done so many of those together with DBAs on Netezza or Terradata. We have been trying to improve table skew, defining indexes and so many other stuff which are not fun at all. All of that just to improve the performance and utilize the HW in the best way we can. Is something like this necessary in Snowflake? Short answer — no. 🙂 I have been working with Snowflake since 2019, I have done several projects and it was never needed. If you are experienced data/SQL developer and you think about what you are doing (e.g. not using cartesian joins and similar techniques) then you should be fine. But of course there might be situations when you are not satisfied with the performance or you think it could be better. Let’s have a look what you can do to find out what is wrong.

As anywhere else you should start with inspecting the query profile.

Query profile example

Pruning

This is one of the important query profile metric. It tells you how many partitions the table consist of and how many had to be accessed in order to get result of your query. Effective pruning means you access as few partitions as possible. If you have to access all or majority of the partitions it could mean that pruning does not work effectively or your query is not optimal one. You can start thinking how to improve data storing for given table — sorting when inserting data, clustering keys.

Detailed explanation of micro-partitions and clustering is out of the scope for this blog post but if you need to know more you should check following documentation page: Micro-partitions & Data Clustering.

Effective pruning example from query profile

Monitor the workload on WH — spilling

If you noticed worse performance of your queries it can also make sense to take a look on how your warehouse is utilized. There is another attribute in query profile which can help you here. It is called spilling and it can be on local storage or remote storage. It means that available memory of virtual warehouse might not be sufficient to hold the intermediate results so the engine starts storing the data first on local disk and if that won’t be enough then on remote disk. All of that slows down the processing because it is the I/O operation. When you spot the spilling in the query profile it is good indicator that you need bigger warehouse which will have more memory, larger disk and most probably you will get your results faster.

As another indicator of warehouse utilization could be to check if there is any query queueing which means there is not enough resources to run another query. You can’t solve this by bigger warehouse instead of that you need to define multi-cluster warehouse. There are two parameters which can help you with query processing and concurrency:

  1. STATEMENT_QUEUED_TIMEOUT_IN_SECONDS

This sets how long the query could be queued before there will be time out. Default value is no timeout at all

2. STATEMENT_TIMEOUT_IN_SECONDS

This parameter says how long query can run before it will be killed. Default value is two days. I would recommend to change this value especially for those virtual warehouses which will be used by users but it could be also valid to change it for warehouses used by downstream/upstream apps so it won’t happen that user would trigger some query or dashboard refresh and let it run for hours. I consider 1 hour as reasonable level for running query but it could be totally different for you. Just review your use cases and set the parameter accordingly. You can set it on different levels. Either for whole account or session or user. But it can also be set for individual warehouse.

What else is worth to think about

I have two more things in mind which you should not miss. The first one is virtual warehouse configuration for users. For sure you have many roles participating on project. It starts with business analyst, through data engineers up to data scientist and others. All of them need to interact with data and all of them need to use some virtual warehouse for their queries. My point here is — do not give them freedom in sense of virtual warehouse scaling and configuration options. Prepare one or two warehouses for them — in different sizes. Define all important parameters like auto suspend, statement timeout, multi cluster strategy (if needed) and give it to users for use. They should not have possibility to update the WH configuration otherwise they will just scale the WH up or use bigger one because it is faster. Been there, done that 😃 Strict rules are necessary.

Another point is more related to whole architecture and data ingestion strategy. Consider doing delta loads. It might significantly improve both performance and cost. This is highly related to concrete use case, your source systems and overall architecture and technology stack. There are use cases where it does not make sense or it is not possible to do but you should always try to think about it. Change data capture (CDC) is complex discipline which requires cooperation of multiple layers. Snowflake offers table streams for that purposes. Thanks to table streams you are able to track what data has changed in the table since last processing.

Summary

That’s all! I have tried to cover most important things in relation to saving Snowflake credits and storage costs. I hope you liked both parts and you learnt something useful. All of that comes from my own real experience. Cost optimization is another crucial part of modern data projects based on SaaS solutions like Snowflake. All those unlimited and immediate scaling options and infinity storage are nice. It opens doors to new use cases, but you need to know also the other side. Everything has its price. You should have this in mind when designing the solutions. Every customer wants to have effective solution with reasonable price.

Do you have any other cost saving tip or recommendation? Share it with me! 🙋‍♂️

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tomáš Sobotík

Tomáš Sobotík

281 Followers

Lead data engineer @Tietoevry. Currently obsessed by cloud technologies and solutions in relation to data & analytics. ☁️ ❄️