Query external Iceberg tables created and managed by Databricks with Snowflake
Author: Jason Drew
Intro — Leveraging Uniform on Databricks as the unified storage layer in your Data Architecture
Databricks is a very powerful platform that can support your entire data strategy. It gives you the ability to store all your data (structured, semi-structured and unstructured) in one place, the Delta Lakehouse. Then using the power and flexibility of Spark, Databricks can handle not only all your Data Engineering and ML/AI needs, but can also serve as the most performant and cost effective Data Warehouse on the market.
Having one data store and one platform simplifies your architecture, lowers your costs and boosts your productivity. As great as all this is, not every company is ready to unify everything to Databricks all at once. In fact many organizations have multi year commitments to Snowflake as their Enterprise data warehouse. Luckily, you can still use Databricks to save 70% or more on your Data Engineering, augment your data with state of the art ML/AI capabilities, and continue to use Snowflake as your serving layer. There are many ways to accomplish this, but today we will go over utilizing the new Databricks UniForm capability to create Gold tables with Iceberg metadata that Snowflake can query directly.
Universal Format (UniForm) for Iceberg compatibility with Delta tables
Per Databricks documentation: “UniForm takes advantage of the fact that both Delta Lake and Iceberg consist of Parquet data files and a metadata layer. UniForm automatically generates Iceberg metadata asynchronously, without rewriting data, so that Iceberg clients can read Delta tables as if they were Iceberg tables. A single copy of the data files serves both formats”
Some companies have concerns about duplicating data, and this is one way to avoid that. This also gives the option of being able to query directly off of Gold tables with either Snowflake or Databricks SQL Endpoints without any reengineering. More importantly, Delta Uniform gives users more flexibility and optionality in choosing the best tool for each use case without being locked into a single tool for all use cases.
High Level Overview
Now lets dive into the how. We will walk through the following steps:
Steps:
- Create Table (Databricks) — Create a table with Uniform with Iceberg Format Enabled on Databricks
- Create Volume (Snowflake)— Create an External Volume on Snowflake that points to the cloud storage location of the table created by Databricks
- Grant Storage Permissions to Snowflake (Azure/AWS/GCP) — Grant read/contributor access on object storage to Snowflake app (depends on the cloud — this article covers Azure)
- Create Catalog Integration (Snowflake) — Create Snowflake Catalog Integration and External Table
- Query Table (Snowflake) — Query the table built by Databricks in Snowflake
- Update Data (Databricks) — Update Data (ETL pipeline) in Databricks and Refresh Metadata in Snowflake automatically from Databricks
- Query Updated Table (Snowflake)
Step 1 (Databricks): Create an external table with Iceberg format enabled
NOTE: You need to use a Databricks Runtime of 13.2 or above to work with UniForm. In my example I used DBR 14.2
Step 2 (Snowflake): Create an External Volume to the external table location, request permission to add app to cloud account
Step 3 (Azure): Grant Storage Blob Contributor to the Snowflake Tenant App
Step 4 (Snowflake): Create catalog integration and create external iceberg table
Step 5 (Snowflake): Query the External Iceberg Table
Step 6 (Databricks): Update table, push REFRESH statement to Snowflake to update metadata file
Update: Starting in Runtime 14.3 Describe Extended will show the Iceberg Metadata file and location.
Step 7 (Snowflake): Query External Iceberg table again, see updates
Wrap Up
So with the flexibility of Databricks UniForm, the Snowflake Python Connector and a little Python code, we can create and maintain our Gold tables within Databricks and allow Snowflake to query those tables as External Iceberg tables. This allows us to use the most efficient platform for Data Engineering while allowing customers to continue to use Snowflake as the presentation layer without maintaining multiple copies of data in both Databricks and Snowflake simultaneously. Hopefully in the future Snowflake will add some more REST Catalog integrations, but for now we can just include a little extra code at the end of our Databricks notebooks to dynamically sync the latest metadata to Snowflake.