Optimize data storage costs by 70% using Databricks, Snowflake & AWS S3

Abhidutt

11 min readMar 5, 2024

Key Takeaways

Our current Snowflake storage cost is $23/TB/month i.e. we will end up paying around $850k/year or upwards of $90k/month by end of December 2024 to Snowflake based on the trends in our current data volume

Estimated annual savings of ~70% or around $150K (on 750 TB of data) in storage costs when we offload the data from Snowflake into Databricks’ delta uniform table format which gives us the ability to read the data back into Snowflake as iceberg tables

ZStandard compression on the data along with sorting helps to reduce the data size by around 75% without any loss in the data and with nearly the same query performance as that of the managed Snowflake tables

Build jobs to offload the ongoing data as well and read them back as external tables which can significantly add to the cost savings

Background

The data warehouse has increased exponentially over the last few years in terms of the number of users accessing the data and the volume. We store around 2.8 PB of data currently and our snowflake storage is increasing around 5% per month on average
We have copious amounts of data in Snowflake which is infrequently used(once a year) on most of the tables and this data can be archived in AWS S3 to save costs and also to build our data lake which will likely be our future state
The current Snowflake storage cost is $23/TB/month(https://www.snowflake.com/pricing/pricing-guide/) i.e we will end up paying around $850K/year or upwards of $90K per month by the end of December 2024

Current storage(petabytes) within Snowflake

Future state by end of 2024 (projected based on 5% avg increase)

Storage efficiency initiatives

Option 1: (Not recommended) Offload to S3 as parquet and read as an external table

High Level Design

Move the historical data from Snowflake into AWS S3 bucket
Create external table in Snowflake on the archived s3 data
Use a union all query if there is a need to query current+historical data in any adhoc analysis

Who will access and use the data?

Data & Analytics team for Tableau reporting and adhoc analysis
External stakeholders who want to query the data from multiple sources

We pay $23/TB per month in Snowflake whereas S3 storage has different tiers of pricing

(S3 costs based on the amount of GET/PUT/LIST requests we expect for our current needs. More details at https://aws.amazon.com/s3/storage-classes/)

Storage analysis — user daily data

One of the largest tables in our snowflake environment is related to our user daily metrics and this table contains information related to the product usage and gives us insights into the customer adoption etc., which is a key part to our internal reporting
The table size is around 750 TB(and growing @1.5TB per day) as of Mar 2024. We do not query the historical data frequently, unless there is a use case for historical restatement which is very unlikely though

Let us consider the costs to offload around 750 TB of data from Snowflake

Snowflake — 750 TB * $23/TB/month = $17,250/mo or $207,000/year

AWS S3 (Glacier tier) — 750 TB * $4.3/TB/month = $3,225/mo or $38,700/year

Savings — $168,300/year

We can see around ~81% reduction in storage costs for user daily historical data which is queried very rarely(per existing snowflake query history patterns)
Sounds pretty exciting however there is a big BUT which we shall see in the disadvantages

Considerations

There will be one time cost associated to move the data from snowflake to S3 since the existing virtual warehouse will be used which will incur snowflake credits based on the time it takes to offload the data. For big tables, this will be a bit higher but we will see break-even and eventually costs savings after a month of archival in S3

External tables on Databricks and reading using spark

We can use the same data and create external tables in Snowflake and Databricks which has the ability to query from multiple data sources and can be more powerful in terms of data analysis going forward
Since the data is stored in Parquet file format, its optimal data source in order to build any ML related data frames as most of the modern data infrastructures use optimal resources and filter out when reading parquet data for further processing

Disadvantages

If we were to read this archived data from Glacier instant retrieval tier and if the data is moved into glacier deep archive based on the life cycle rules we had configured, it becomes very expensive at scale to restore it and query for any quarterly/yearly analysis of the data
AWS S3 charges $30/TB to read the data from the glacier tier, which can negate all our savings if we read a large volume of data and it also takes 3–12 hours for AWS to restore the data

Option 2: Databricks, Snowflake and Delta tables

Leverage delta table uniform format by processing the data in Databricks using ZStandard compression and sorting on parquet. Sounds too technical? I know right!! But its easy to understand once we get the hang of it

High Level Design

Move the historical data from Snowflake into AWS S3 bucket
Spin up Databricks clusters to load this data from S3 into Delta lake by leveraging Databricks’ latest uniform format and ZSTD compression with sorting on parquet (https://docs.databricks.com/en/delta/uniform.html)
Generate Iceberg catalog using the delta uniform from the above step
Read the external iceberg table back in Snowflake with significant query execution efficiency compared to reading an external parquet table (https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table)

Storage analysis — user daily data

We will use the same user daily table for this analysis too and see how this is way more efficient compared to option 1
Lets take one day worth of data - June 1st from 2023 for this analysis. We exported the data into a new table for better understanding of the data and we can see below that the table has 22.9 billion rows and occupies 1.3TB in Snowflake

Step 1 — Export the data from Snowflake into S3 as parquet

This can be done using Snowflake’s copy command

COPY INTO <s3_stage>
FROM 
(
    SELECT * FROM <db>.<schema>.AUD_OFFLOAD
    WHERE DATE = '2023-06-01' 
)
FILE_FORMAT = 
(
    TYPE = PARQUET
    COMPRESSION = SNAPPY
)
MAX_FILE_SIZE = 245000000
HEADER = TRUE
OVERWRITE = TRUE;

Step 2 — Understand the size of the offloaded data in S3

Interesting observation here is that the size got increased to 2.4TB from 1.3TB. Snowflake has their own compression techniques under the hood but when we export the data out as parquet, it just got bloated up. This is another disadvantage if we store the data as just parquet for archival purposes since we end up paying more than expected even though the data is archived in S3 Glacier tier

Step 3: Convert the parquet data into a delta table in Databricks

Spin up a Databricks cluster to read the parquet data from S3 and load the data into a Databricks table for which the delta uniform property has been enabled

We will run the cluster by specifying the compression to be snappy. Spark would then write the underlying parquet files for the delta format with Snappy compression

Step 4 — Create a delta uniform enabled table in Databricks

Notice the TBLPROPERTIES below. This enables us to generate an iceberg catalog along with the delta catalog to read the data from Snowflake or any other compute engine of our choice. More details at https://docs.databricks.com/en/delta/uniform.html

CREATE TABLE <catalog>.<schema>.user_daily_snappy_load
(
Col 1
Col 2
.
.
.
Col n
)
USING DELTA 
PARTITIONED BY (DATE) 
LOCATION 's3://some_path' 
TBLPROPERTIES ( 
'delta.universalFormat.enabledFormats' = 'iceberg');

We can see above that the delta table creation led to generation of two catalogs within the S3 path, one for delta format and the other for iceberg

Snappy compression for loading the data

We are leveraging Databricks Workflows to run a simple task which takes the data from our S3 path, reads it into the spark memory and writes it back into the delta table using snappy compression

insert into user_daily_snappy_load
select * from <s3_parquet_data>

The resulting data set is around 2.7 TB. Even though the data is exactly the same, this actually lead to size increase compared to the original parquet data that we exported out of Snowflake

ZSTD compression for loading the data

ZSTD is a compression technique developed by Meta. More details at https://facebook.github.io/zstd/. We are not going into the details of how the compression is done under the hood but TL;DR this is more efficient in terms of compression compared to Snappy
Lets try the same exercise with compression as ZSTD (ZStandard) instead of Snappy while creating the Databricks cluster

insert into user_daily_zstd_load
select * from <s3_parquet_data>

We can see that size got reduced by 44% vs snappy but still higher than the original size in Snowflake

ZSTD with sorting to further compress the data

Our user daily table contains few fields with less cardinality but still have the 22.9B rows that we ingested

Parquet uses Run Length Encoding to intelligently store the data if its provided in a sorted format. It will sort similar data together by reordering the rows. This article gives a pretty good explanation about RLE in parquet and how it works
To leverage this property, we needed to understand the correct column reordering while storing the data and after many trial and errors, we stumbled upon a certain column order with high cardinal columns at the beginning of the order by and sorting towards the right with low cardinality columns. This final reordering and sorting gave us a nice benchmark out of all the tests that we had done for data compression
All the job settings were exactly as that of zstd except for adding an order by clause while inserting the data into the delta table

insert into user_daily_zstd_sorting_load
select * from <s3_parquet_data>
order by col 1, col 2, col 3...col n

The results were too good to be true but the sorting and zstd compression reduced the data size by 78% to 282.5 GB

We had quickly done our testing to see if this was right and our queries yielded the exact same results with that of the original data in Snowflake

Read the data back into Snowflake using the iceberg catalog

We wont be going into the steps that are required to establish the connectivity between delta table and Snowflake in this blog post. An article from Databricks nicely outlines all the relevant information required for the setup which we had followed in order to read the data into Snowflake

Assuming the integration is done, we can start querying the data.

Get the iceberg catalog metadata location from Databricks

Create a table in Snowflake using the iceberg metadata location

Querying the data gives the exact same results as expected

Lets create a table in Snowflake using this external iceberg table

We can see here that the table has been created with the zstd sorted table size but contains the exact data as that of the original snowflake table

This is an even better scenario because we can bring the data back into Snowflake if needed, but with 70% reduction in the size and the query performance on the downstream queries was within 10% threshold vs using the original Snowflake managed table

1 year cost savings for storing the data in compressed format

We had offloaded the data from Jan 2023 to Apr 2023. While we do not see any savings for the 1st month due to the Databricks and snowflake charges related to the offloading and conversion, the real benefit comes from 2nd month. We save around 74% in storage costs overall by doing the above compression
This also gives us the advantage to read the data in Databricks as well, which in turn helps us to migrate some of the processes in to Databricks
The user daily table is around 750TB in Snowflake currently and once we offload the full table in to delta, we will save around $150k/year in storage costs
We have identified few more tables with an estimated size of 340TB which can be offloaded in the similar manner and gives us an additional savings of around $68k/year

Few considerations

We have tried running the Snowflake export copy command by adding the order by clause. This resulted in a runtime of 50mins for exporting one day of data vs 10–11 mins on average per day without order by. While the data size of the export came down from ~2.4 TB to ~450 GB per day (tested for a sample date of 2023–09–01), the compute costs negate the space savings as storage is cheap
Writing the delta table with Snappy compression and order by instead of ZSTD and order by gave us a compression size of 499 GB vs 285 GB which is around 40% decrease in size reduction. While snappy compression gives excellent read performance vs ZSTD compressed data, this is still fine with us as this is historical data and we consume it for any quarterly/annual analysis. The tradeoff on the storage savings are higher compared to the 1–2 queries that happen in the year.

Optimize data storage costs by 70% using Databricks, Snowflake & AWS S3

Key Takeaways

Background

Storage efficiency initiatives

Option 1: (Not recommended) Offload to S3 as parquet and read as an external table

High Level Design

Who will access and use the data?

Storage analysis — user daily data

Considerations

External tables on Databricks and reading using spark

Disadvantages

Option 2: Databricks, Snowflake and Delta tables

High Level Design

Storage analysis — user daily data

Snappy compression for loading the data

ZSTD compression for loading the data

ZSTD with sorting to further compress the data

Read the data back into Snowflake using the iceberg catalog

1 year cost savings for storing the data in compressed format

Few considerations

Written by Abhidutt