Query external Iceberg tables created and managed by Databricks with Snowflake

Jason Drew
DBSQL SME Engineering
5 min readDec 18, 2023

Author: Jason Drew

Author Screenshot: Build on Databricks Serve on Snowflake with Delta Uniform

Intro — Leveraging Uniform on Databricks as the unified storage layer in your Data Architecture

Databricks is a very powerful platform that can support your entire data strategy. It gives you the ability to store all your data (structured, semi-structured and unstructured) in one place, the Delta Lakehouse. Then using the power and flexibility of Spark, Databricks can handle not only all your Data Engineering and ML/AI needs, but can also serve as the most performant and cost effective Data Warehouse on the market.

Having one data store and one platform simplifies your architecture, lowers your costs and boosts your productivity. As great as all this is, not every company is ready to unify everything to Databricks all at once. In fact many organizations have multi year commitments to Snowflake as their Enterprise data warehouse. Luckily, you can still use Databricks to save 70% or more on your Data Engineering, augment your data with state of the art ML/AI capabilities, and continue to use Snowflake as your serving layer. There are many ways to accomplish this, but today we will go over utilizing the new Databricks UniForm capability to create Gold tables with Iceberg metadata that Snowflake can query directly.

Universal Format (UniForm) for Iceberg compatibility with Delta tables

Per Databricks documentation: “UniForm takes advantage of the fact that both Delta Lake and Iceberg consist of Parquet data files and a metadata layer. UniForm automatically generates Iceberg metadata asynchronously, without rewriting data, so that Iceberg clients can read Delta tables as if they were Iceberg tables. A single copy of the data files serves both formats”

Some companies have concerns about duplicating data, and this is one way to avoid that. This also gives the option of being able to query directly off of Gold tables with either Snowflake or Databricks SQL Endpoints without any reengineering. More importantly, Delta Uniform gives users more flexibility and optionality in choosing the best tool for each use case without being locked into a single tool for all use cases.

High Level Overview

Author Screenshot: High level architecture of Iceberg setup

Now lets dive into the how. We will walk through the following steps:

Steps:

  1. Create Table (Databricks) — Create a table with Uniform with Iceberg Format Enabled on Databricks
  2. Create Volume (Snowflake)— Create an External Volume on Snowflake that points to the cloud storage location of the table created by Databricks
  3. Grant Storage Permissions to Snowflake (Azure/AWS/GCP) — Grant read/contributor access on object storage to Snowflake app (depends on the cloud — this article covers Azure)
  4. Create Catalog Integration (Snowflake) — Create Snowflake Catalog Integration and External Table
  5. Query Table (Snowflake) — Query the table built by Databricks in Snowflake
  6. Update Data (Databricks) — Update Data (ETL pipeline) in Databricks and Refresh Metadata in Snowflake automatically from Databricks
  7. Query Updated Table (Snowflake)

Step 1 (Databricks): Create an external table with Iceberg format enabled

NOTE: You need to use a Databricks Runtime of 13.2 or above to work with UniForm. In my example I used DBR 14.2

For demonstration purposes, I used a CTAS statement from an existing table, specifying the tblproperties (‘delta.universalFormat.enabledFormats’ = ‘iceberg’). I also specified the external location for the table.
We can see we are starting off with 7,500,000 rows.
This creates a slightly different subfolder structure than just a regular Delta table. The parquet files are located in the eu, lr, and ry folders. The _delta_log file folder is the same, allowing you to have all the same Delta format capabilities.
Going into the metadata folder, you’ll see a metadata.json file which tells Iceberg what data files to use.

Step 2 (Snowflake): Create an External Volume to the external table location, request permission to add app to cloud account

First I created a DATABASE named ICEBERG_DB and a SCHEMA named ICEBERG. Then I created an external volume named delta_barbarian_vol. I linked the Snowflake documentation on creating an external volume to the Step 2 Title. This explains the process for all three clouds, but I used Azure.
Once the volume is created, I executed “desc external volume delta_barbarian_vol”. By clicking the property_value below, I took note of the AZURE_MULTI_TENANT_APP_NAME highlighted in Green above, and the AZURE_CONSENT_URL highlighted in Yellow above.
I pasted the AZURE_CONSENT_URL into my browser and clicked Accept

Step 3 (Azure): Grant Storage Blob Contributor to the Snowflake Tenant App

Go to your storage account, click Access Control (IAM) then click Add role assignment
Search for and select the Storage Blob Contriutor Role. Then click Next
Click Select members, then search for the AZURE_MULTI_TENANT_APP_NAME shown highlighted in Green earlier. NOTE: You only want the part BEFORE the underscore. Select the member, click Select, then click Review + Assign

Step 4 (Snowflake): Create catalog integration and create external iceberg table

After you create the external volume, you’ll need to create a catalog integration. Snowflake can integrate with Glue as a catalog but not currently Unity or any other External REST Catalog. This means we need to make the CATALOG_SOURCE=OBJECT_STORE and that means we have to manually tell Snowflake where the latest metadata.json file is.
Because it’s an OBJECT_STORE for the CATALOG_SOURCE, making this an externally managed table, we need to tell Snowflake what metadata.json file to use. Since we just created this table, there is only one located in the metadata subfolder we showed earlier.

Step 5 (Snowflake): Query the External Iceberg Table

Notice this number matches the row count from the same query in Databricks

Step 6 (Databricks): Update table, push REFRESH statement to Snowflake to update metadata file

I added a new row with values I knew wouldn’t be in the original orders table. As you can see we now have one more row in our source table.
This creates a new metadata.json file.
To avoid having to manually lookup the new metadata file each time I make an update to the table in Databricks, I include this function to get the latest metadata.json file.

Update: Starting in Runtime 14.3 Describe Extended will show the Iceberg Metadata file and location.

This is a slightly cleaner way to populate the latest variable using Runtime 14.3+.
I then installed the snowflake python connector and setup the connection to my Snowflake account.
Now using the connector and the {latest} variable populated by my function, I can now dynamically send the REFRESH command to Snowflake to make sure it accesses the latest metadata.json file and data.

Step 7 (Snowflake): Query External Iceberg table again, see updates

I can see my row count matches
The new row I added from Databricks is there.

Wrap Up

So with the flexibility of Databricks UniForm, the Snowflake Python Connector and a little Python code, we can create and maintain our Gold tables within Databricks and allow Snowflake to query those tables as External Iceberg tables. This allows us to use the most efficient platform for Data Engineering while allowing customers to continue to use Snowflake as the presentation layer without maintaining multiple copies of data in both Databricks and Snowflake simultaneously. Hopefully in the future Snowflake will add some more REST Catalog integrations, but for now we can just include a little extra code at the end of our Databricks notebooks to dynamically sync the latest metadata to Snowflake.

--

--