Optimize data storage costs by 70% using Databricks, Snowflake & AWS S3
Key Takeaways
- Our current Snowflake storage cost is $23/TB/month i.e. we will end up paying around $850k/year or upwards of $90k/month by end of December 2024 to Snowflake based on the trends in our current data volume
- Estimated annual savings of ~70% or around $150K (on 750 TB of data) in storage costs when we offload the data from Snowflake into Databricks’ delta uniform table format which gives us the ability to read the data back into Snowflake as iceberg tables
- ZStandard compression on the data along with sorting helps to reduce the data size by around 75% without any loss in the data and with nearly the same query performance as that of the managed Snowflake tables
- Build jobs to offload the ongoing data as well and read them back as external tables which can significantly add to the cost savings
Background
- The data warehouse has increased exponentially over the last few years in terms of the number of users accessing the data and the volume. We store around 2.8 PB of data currently and our snowflake storage is increasing around 5% per month on average
- We have copious amounts of data in Snowflake which is infrequently used(once a year) on most of the tables and this data can be archived in AWS S3 to save costs and also to build our data lake which will likely be our future state
- The current Snowflake storage cost is $23/TB/month(https://www.snowflake.com/pricing/pricing-guide/) i.e we will end up paying around $850K/year or upwards of $90K per month by the end of December 2024
Current storage(petabytes) within Snowflake
Future state by end of 2024 (projected based on 5% avg increase)
Storage efficiency initiatives
Option 1: (Not recommended) Offload to S3 as parquet and read as an external table
High Level Design
- Move the historical data from Snowflake into AWS S3 bucket
- Create external table in Snowflake on the archived s3 data
- Use a union all query if there is a need to query current+historical data in any adhoc analysis
Who will access and use the data?
- Data & Analytics team for Tableau reporting and adhoc analysis
- External stakeholders who want to query the data from multiple sources
We pay $23/TB per month in Snowflake whereas S3 storage has different tiers of pricing
(S3 costs based on the amount of GET/PUT/LIST requests we expect for our current needs. More details at https://aws.amazon.com/s3/storage-classes/)
Storage analysis — user daily data
- One of the largest tables in our snowflake environment is related to our user daily metrics and this table contains information related to the product usage and gives us insights into the customer adoption etc., which is a key part to our internal reporting
- The table size is around 750 TB(and growing @1.5TB per day) as of Mar 2024. We do not query the historical data frequently, unless there is a use case for historical restatement which is very unlikely though
Let us consider the costs to offload around 750 TB of data from Snowflake
Snowflake — 750 TB * $23/TB/month = $17,250/mo or $207,000/year
AWS S3 (Glacier tier) — 750 TB * $4.3/TB/month = $3,225/mo or $38,700/year
Savings — $168,300/year
- We can see around ~81% reduction in storage costs for user daily historical data which is queried very rarely(per existing snowflake query history patterns)
- Sounds pretty exciting however there is a big BUT which we shall see in the disadvantages
Considerations
There will be one time cost associated to move the data from snowflake to S3 since the existing virtual warehouse will be used which will incur snowflake credits based on the time it takes to offload the data. For big tables, this will be a bit higher but we will see break-even and eventually costs savings after a month of archival in S3
External tables on Databricks and reading using spark
- We can use the same data and create external tables in Snowflake and Databricks which has the ability to query from multiple data sources and can be more powerful in terms of data analysis going forward
- Since the data is stored in Parquet file format, its optimal data source in order to build any ML related data frames as most of the modern data infrastructures use optimal resources and filter out when reading parquet data for further processing
Disadvantages
- If we were to read this archived data from Glacier instant retrieval tier and if the data is moved into glacier deep archive based on the life cycle rules we had configured, it becomes very expensive at scale to restore it and query for any quarterly/yearly analysis of the data
- AWS S3 charges $30/TB to read the data from the glacier tier, which can negate all our savings if we read a large volume of data and it also takes 3–12 hours for AWS to restore the data
Option 2: Databricks, Snowflake and Delta tables
Leverage delta table uniform format by processing the data in Databricks using ZStandard compression and sorting on parquet. Sounds too technical? I know right!! But its easy to understand once we get the hang of it
High Level Design
- Move the historical data from Snowflake into AWS S3 bucket
- Spin up Databricks clusters to load this data from S3 into Delta lake by leveraging Databricks’ latest uniform format and ZSTD compression with sorting on parquet (https://docs.databricks.com/en/delta/uniform.html)
- Generate Iceberg catalog using the delta uniform from the above step
- Read the external iceberg table back in Snowflake with significant query execution efficiency compared to reading an external parquet table (https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table)
Storage analysis — user daily data
- We will use the same user daily table for this analysis too and see how this is way more efficient compared to option 1
- Lets take one day worth of data - June 1st from 2023 for this analysis. We exported the data into a new table for better understanding of the data and we can see below that the table has 22.9 billion rows and occupies 1.3TB in Snowflake
Step 1 — Export the data from Snowflake into S3 as parquet
This can be done using Snowflake’s copy command
COPY INTO <s3_stage>
FROM
(
SELECT * FROM <db>.<schema>.AUD_OFFLOAD
WHERE DATE = '2023-06-01'
)
FILE_FORMAT =
(
TYPE = PARQUET
COMPRESSION = SNAPPY
)
MAX_FILE_SIZE = 245000000
HEADER = TRUE
OVERWRITE = TRUE;
Step 2 — Understand the size of the offloaded data in S3
Interesting observation here is that the size got increased to 2.4TB from 1.3TB. Snowflake has their own compression techniques under the hood but when we export the data out as parquet, it just got bloated up. This is another disadvantage if we store the data as just parquet for archival purposes since we end up paying more than expected even though the data is archived in S3 Glacier tier
Step 3: Convert the parquet data into a delta table in Databricks
Spin up a Databricks cluster to read the parquet data from S3 and load the data into a Databricks table for which the delta uniform property has been enabled
We will run the cluster by specifying the compression to be snappy. Spark would then write the underlying parquet files for the delta format with Snappy compression
Step 4 — Create a delta uniform enabled table in Databricks
Notice the TBLPROPERTIES below. This enables us to generate an iceberg catalog along with the delta catalog to read the data from Snowflake or any other compute engine of our choice. More details at https://docs.databricks.com/en/delta/uniform.html
CREATE TABLE <catalog>.<schema>.user_daily_snappy_load
(
Col 1
Col 2
.
.
.
Col n
)
USING DELTA
PARTITIONED BY (DATE)
LOCATION 's3://some_path'
TBLPROPERTIES (
'delta.universalFormat.enabledFormats' = 'iceberg');
We can see above that the delta table creation led to generation of two catalogs within the S3 path, one for delta format and the other for iceberg
Snappy compression for loading the data
We are leveraging Databricks Workflows to run a simple task which takes the data from our S3 path, reads it into the spark memory and writes it back into the delta table using snappy compression
insert into user_daily_snappy_load
select * from <s3_parquet_data>
The resulting data set is around 2.7 TB. Even though the data is exactly the same, this actually lead to size increase compared to the original parquet data that we exported out of Snowflake
ZSTD compression for loading the data
- ZSTD is a compression technique developed by Meta. More details at https://facebook.github.io/zstd/. We are not going into the details of how the compression is done under the hood but TL;DR this is more efficient in terms of compression compared to Snappy
- Lets try the same exercise with compression as ZSTD (ZStandard) instead of Snappy while creating the Databricks cluster
insert into user_daily_zstd_load
select * from <s3_parquet_data>
We can see that size got reduced by 44% vs snappy but still higher than the original size in Snowflake
ZSTD with sorting to further compress the data
- Our user daily table contains few fields with less cardinality but still have the 22.9B rows that we ingested
- Parquet uses Run Length Encoding to intelligently store the data if its provided in a sorted format. It will sort similar data together by reordering the rows. This article gives a pretty good explanation about RLE in parquet and how it works
- To leverage this property, we needed to understand the correct column reordering while storing the data and after many trial and errors, we stumbled upon a certain column order with high cardinal columns at the beginning of the order by and sorting towards the right with low cardinality columns. This final reordering and sorting gave us a nice benchmark out of all the tests that we had done for data compression
- All the job settings were exactly as that of zstd except for adding an order by clause while inserting the data into the delta table
insert into user_daily_zstd_sorting_load
select * from <s3_parquet_data>
order by col 1, col 2, col 3...col n
The results were too good to be true but the sorting and zstd compression reduced the data size by 78% to 282.5 GB
We had quickly done our testing to see if this was right and our queries yielded the exact same results with that of the original data in Snowflake
Read the data back into Snowflake using the iceberg catalog
We wont be going into the steps that are required to establish the connectivity between delta table and Snowflake in this blog post. An article from Databricks nicely outlines all the relevant information required for the setup which we had followed in order to read the data into Snowflake
Assuming the integration is done, we can start querying the data.
Get the iceberg catalog metadata location from Databricks
Create a table in Snowflake using the iceberg metadata location
Querying the data gives the exact same results as expected
Lets create a table in Snowflake using this external iceberg table
We can see here that the table has been created with the zstd sorted table size but contains the exact data as that of the original snowflake table
This is an even better scenario because we can bring the data back into Snowflake if needed, but with 70% reduction in the size and the query performance on the downstream queries was within 10% threshold vs using the original Snowflake managed table
1 year cost savings for storing the data in compressed format
- We had offloaded the data from Jan 2023 to Apr 2023. While we do not see any savings for the 1st month due to the Databricks and snowflake charges related to the offloading and conversion, the real benefit comes from 2nd month. We save around 74% in storage costs overall by doing the above compression
- This also gives us the advantage to read the data in Databricks as well, which in turn helps us to migrate some of the processes in to Databricks
- The user daily table is around 750TB in Snowflake currently and once we offload the full table in to delta, we will save around $150k/year in storage costs
- We have identified few more tables with an estimated size of 340TB which can be offloaded in the similar manner and gives us an additional savings of around $68k/year
Few considerations
- We have tried running the Snowflake export copy command by adding the order by clause. This resulted in a runtime of 50mins for exporting one day of data vs 10–11 mins on average per day without order by. While the data size of the export came down from ~2.4 TB to ~450 GB per day (tested for a sample date of 2023–09–01), the compute costs negate the space savings as storage is cheap
- Writing the delta table with Snappy compression and order by instead of ZSTD and order by gave us a compression size of 499 GB vs 285 GB which is around 40% decrease in size reduction. While snappy compression gives excellent read performance vs ZSTD compressed data, this is still fine with us as this is historical data and we consume it for any quarterly/annual analysis. The tradeoff on the storage savings are higher compared to the 1–2 queries that happen in the year.