Performance of Querying Uniform-Iceberg Tables in Snowflake written by Databricks

Published in

DBSQL SME Engineering

6 min readFeb 8, 2024

How External Iceberg tables created in Databricks perform with Snowflake

Background

This new architecture of leveraging the Uniform Table format to write data via Databricks and Query via External Iceberg Tables in Snowflake is a powerful tool that allows users all the flexibility of not being stuck in one system while at the same time drastically reducing manual ETL costs across systems with a much more simplified architecture. You no longer need to “dump” data into Snowflake’s proprietary table format, creating extra steps and data movement costs. Instead, you can write to Uniform in Databricks, and query the table directly in Snowflake with no added complexity in between!

In mid-December I wrote a blog on how to Query external Iceberg tables created and managed by Databricks UniForm with Snowflake. This push method is one of two ways to keep external Iceberg metadata updated in Snowflake. The other pull method, where Snowflake fetches the latest Iceberg metadata location from Unity Catalog and executes a Table Refresh, has been written about by my colleague Usman Zubair and can be found here.

While both these methods provide tremendous flexibility in a mixed Databricks/Snowflake environment, the next natural question to determine the usability of this pattern was to see how well these externally managed Iceberg tables that are created by Databricks in Uniform would perform in Snowflake compared to native tables. In this blog, we will review run some TPC DS Benchmarking Queries to get a feel for how much tradeoff we get for the immense flexibility of utilizing Uniform as the back-end table format written by Databricks and read by Snowflake.

Testing Method

To test this, I used the TPC-DS query that came directly from Snowflake Documentation. I compared the query on the native Snowflake Sample Data tables provided in the TPCDS_SF10TCL schema against those exact same tables in external Iceberg format created by Databricks UniForm. I also compared these queries against the UniForm tables with Databricks. Each query was run three times in a “cold” state (no cache, fresh warehouse startup) followed by three “warm” queries (no results cache, but active disk cache). I averaged the results of the three runs using both a L and M sized warehouse in both Snowflake and Databricks.

TPCDS_SF10TCL tables used:

1. CALL_CENTER: 54 rows, clustered by cc_call_center_sk

2. CATALOG_SALES: 14.4 billion rows, clustered by cs_sold_date_sk

3. DATE_DIM: 73K rows, clustered by d_date_sk

4. ITEM: 402K rows, clustered by i_item_sk

TPCDS_SF10TCL query:

alter session set use_cached_result = false; --(with each Snowflake Query)

set use_cached_result = false; --(with each Databricks Query)

with v1 as(
 select i_category, i_brand, cc_name, d_year, d_moy,
 sum(cs_sales_price) sum_sales,
 avg(sum(cs_sales_price)) over
 (partition by i_category, i_brand,
 cc_name, d_year)
 avg_monthly_sales,
 rank() over
 (partition by i_category, i_brand,
 cc_name
 order by d_year, d_moy) rn
 from item_ice, catalog_sales_ice, date_dim_ice, call_center_ice
 where cs_item_sk = i_item_sk and
 cs_sold_date_sk = d_date_sk and
 cc_call_center_sk= cs_call_center_sk and
 (
 d_year = 1999 or
 ( d_year = 1999–1 and d_moy =12) or
 ( d_year = 1999+1 and d_moy =1)
 )
 group by i_category, i_brand,
 cc_name , d_year, d_moy),
v2 as(
 select v1.i_category ,v1.d_year, v1.d_moy ,v1.avg_monthly_sales
 ,v1.sum_sales, v1_lag.sum_sales psum, v1_lead.sum_sales nsum
 from v1, v1 v1_lag, v1 v1_lead
 where v1.i_category = v1_lag.i_category and
 v1.i_category = v1_lead.i_category and
 v1.i_brand = v1_lag.i_brand and
 v1.i_brand = v1_lead.i_brand and
 v1.cc_name = v1_lag.cc_name and
 v1.cc_name = v1_lead.cc_name and
 v1.rn = v1_lag.rn + 1 and
 v1.rn = v1_lead.rn - 1)
select *
from v2
where d_year = 1999 and
 avg_monthly_sales > 0 and
 case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1
order by sum_sales - avg_monthly_sales, 3
limit 100;

Bringing Data into Databricks:

First I exported the Snowflake tables into Delta tables as is. I then created Iceberg compliant tables from the Delta tables, clustering on the same keys used by the original Snowflake tables. Those tables were then queried by Snowflake as External Iceberg tables.

NOTE: For the purposes of this blog, I use the term UniForm and Iceberg interchangeably.

Using the largest table, CALL_CENTER as an example, I did the following:

Set my connection options:

2. Loaded the Snowflake Table into a Dataframe:

3. Wrote the Dataframe to a Delta table:

4. Created an Iceberg compliant table (UniForm) from the Delta table with matching clustering:

5. Optimized the table to ensure clustering was activated. Note: In the future, this will be taken care of automatically with Predictive Optimization.

6. Run the above SQL query in 3 scenarios: On Native Snowflake Tables, External Snowflake Table using Uniform- Iceberg, and on Databricks SQL using Uniform- Iceberg.

Results

I run the query in 3 ways: against Snowflake Native Tables, Snowflake Iceberg External Tables with Uniform, and lastly on Databricks SQL with Uniform — Iceberg.

Query Runtimes on Snowflake Native vs Uniform (hot and cold, 2 cluster sizes)

*Note — On a relative sizing scale, a Databricks SQL Warehouse Medium is roughly equivalent to a Snowflake Large. This is why we use price per query performance number to normalize against sizing. What matters is runtime and total price per query.

In the above results, a few key points stand out:

When running the query on the same Snowflake Cluster, Uniform-Iceberg and Native table performance are pretty close! Uniform is actually 5% cheaper on the same Snowflake Cluster on a medium warehouse.
Databricks Price/Perf is quite a bit better even using Uniform-Iceberg to run the query (~25% more price performant, ~40% faster in raw runtime).
On a raw runtime basis, Native Tables in Snowflake vs using Uniform — Iceberg is pretty negligible, no matter which cluster size.

The results are pretty great! This shows that when using Snowflake to query the data, the query price is very similar, and even a little bit cheaper using Uniform, suggesting virtually no performance tradeoff. This validates that building your pipelines with Uniform — no matter which query engine — is a very strong design pattern that gives users maximum flexibility and ETL simplicity for multi-platform architectures.

Conclusion

Snowflake performance against External Iceberg tables created by Databricks UniForm performed as well or better than the same tables in Native Snowflake format. Furthermore, those same tables performed even better in Databricks. Also, it appears Databricks is better at utilizing Disk cache for Warm queries.

These results bode well for using Databricks as the data engineering engine for Snowflake (typically saving 70% or more for ETL workloads), managing only a single Gold level table that both Snowflake and Databricks can read. Furthermore, this also leaves open the option of further platform consolidation and cost savings by eventually transitioning to Databricks SQL without any pipeline reengineering.

Performance of Querying Uniform-Iceberg Tables in Snowflake written by Databricks

Written by Jason Drew