The duck escapes with the credits.

Quack, Quack, Ka-Ching: Cut Costs by Querying Snowflake from DuckDB

How to leverage Snowflake’s support for interoperable open lakehouse technology — Iceberg — to save money.

Jonathan Merlevede
datamindedbe

--

Snowflake recently released extensive support for the open table format Iceberg. Using open formats enhances data agility and reduces lock-in. This post explores leveraging this flexibility to decrease Snowflake’s high compute costs by using DuckDB to query Snowflake-managed data.

What is Apache Iceberg?

Apache Iceberg is a table format specification created by Netflix in 2017. In 2018, Netflix open-sourced the Iceberg project and donated it to the Apache Software Foundation.

Netflix designed Iceberg to overcome the limitations of data lakes that contain plain partitioned data files with minimal metadata — also known as Hive-formatted tables. These included performance problems (many file listings, many partitions, limited pruning) and the absence of features that had become common in data warehouses, such as time travel, schema evolution, and ACID transactions.

Table format specification

A table format specification is a standard way of writing metadata to define a table. Metadata allows tools to know what is in a dataset without having to read all of the data inside, but they can also assign different meanings to data — e.g., by marking it as non-current.

Apache Iceberg is not a storage format. You can store your Iceberg table’s data in formats such as Parquet, ORC, or Avro; Iceberg is a standard way to organize metadata next to those data files.

Open toolbox and interoperability

Many engines and tools implement the Iceberg spec. Tools implementing the same spec can all interact with the same Iceberg tables, which is why Apache Iceberg is “multi-engine.” Most major engines, such as AWS Athena, Trino (Starburst), DuckDB, and Snowflake, support Iceberg.

This interoperable approach fundamentally differs from what was common in the past. Databases like Oracle, Vertica, BigQuery, and so on store metadata and data in proprietary formats, presenting a challenge for seamless interoperability, requiring lots of data copying, and potentially leading to vendor lock-in.

Paradigm shift

By working with a centrally accessible format independent of the compute engine, compute engines become interchangeable. This allows us to use the most suited computing engine for a particular task, without having to move around data. Data written by one tool can immediately be read by another.

This architecture results in a paradigm shift, favoring data sharing over redundant data duplication across different computing engines.

Image adapted from https://www.youtube.com/watch?v=_GW3GYZK66U

Featureful lakehouses

In addition to facilitating interoperability, Apache Iceberg enables an ever-growing number of features that close the feature gap between data lakes and data warehouses, giving rise to what is now known as the lakehouse. These include time travel, ACID transactions, partition evolution, hidden partitioning, schema evolution, saving object storage costs, etc. This blog post only focuses on interoperability.

Apache Iceberg and Snowflake

On December 4, 2023, Snowflake published a blog post announcing their Apache Iceberg integration is in Public preview.

Snowflake now offers two ways to work with Iceberg tables:

  • External catalog. These tables are written externally, by a tool such as Apache Spark, Apache Flink, or even Trino, to your object store and registered in an external catalog such as the Hive Metastore, the AWS Glue Data Catalog, or Nessie. In this mode, tables are read-only from Snowflake.
  • Snowflake catalog. These tables are read-write from Snowflake and read-only externally.

In both cases, Snowflake stores all data and Iceberg metadata in your own (cloud) object storage. Both ways of working with Iceberg have merits. Given your situation, it should be clear which is the most appropriate.

All data and Iceberg metadata are on your own (cloud) object storage.

When using Iceberg tables with the Snowflake catalog, Snowflake behaves like it always does; it remains a “zero-ops” warehouse, and you can remain carefree while Snowflake performs storage maintenance operations like compaction, expiring snapshots, and cleaning up of orphaned files. Iceberg tables behave nearly identically to Snowflake-native tables, although there are some limitations that you may want to check out.

This post assumes that your data lives and breathes in Snowflake and that Snowflake is where your large-scale processing happens. Using the Snowflake catalog is then the right choice.

Image from https://www.snowflake.com/blog/unifying-iceberg-tables/

Iceberg Catalog

When using Iceberg tables with the Snowflake catalog, the “catalog” remains on Snowflake’s side. To determine whether this impedes our ability to interact with data directly, we should know what the metadata catalog does; after all, is a table’s metadata not stored in Iceberg’s metadata files? Catalogs bring at least two things to the table (pun intended):

  • Database abstraction. Iceberg is a specification for technical metadata at the table level, and Iceberg metadata files are stored next to your data files. The table specification is unaware of concepts such as table names, schemas, and databases or collections. A metadata catalog allows you to consider your “bag of tables” as a database by introducing hierarchy and storing a map of table names onto prefixes.
  • Pointer to the current table version. When mutating an Iceberg table, new data and metadata files are added and stored next to the old ones. The catalog keeps track of table prefixes but must also know which metadata files are “current.”

TL;DR: You need access to the catalog to know which table version is current, and to access tables by name and write queries as you are used to.

Bergs need careful filing in a metadata catalog.

Iceberg Catalog SDK

If you want to read your Iceberg tables using Spark, you’re in luck! Snowflake released an Iceberg Catalog SDK for Spark, which implements Spark’s catalog interface using an (otherwise undocumented) Snowflake Catalog API. Currently, this Snowflake functionality is free and does not require a running warehouse, cost “serverless credits,” or incur “cloud services” charges.

Snowflake’s announcement provides readily usable sample code and confirms that Spark reads Iceberg metadata and Parquet files directly from the customer-managed storage account:

After making an initial connection to Snowflake via the Iceberg Catalog SDK, Spark can read Iceberg metadata and Parquet files directly from the customer-managed storage account. With this configuration, multiple engines can consistently read from a single copy of data.

Unfortunately, this is not immediately helpful for querying from DuckDB. There is no Snowflake catalog SDK available for DuckDB. Luckily, we can use the file system directly to read our data.

Image slightly adapted from https://www.snowflake.com/blog/unifying-iceberg-tables/

Iceberg Filesystem Catalog

If it seems possible to implement a catalog on top of a filesystem or object store through straightforward naming conventions, that is because it is! Indeed, Iceberg’s Hadoop catalog is just that. Its class documentation reads:

HadoopCatalog […] uses a specified directory under a specified filesystem as the warehouse directory, and organizes multiple levels directories that mapped to the database, namespace and the table respectively. The HadoopCatalog takes a location as the warehouse directory. When creating a table such as $db.$tbl, it creates $db/$tbl directory under the warehouse directory, and put the table metadata into that directory.

For Iceberg to know which metadata is the latest, it expects the filesystem tables’ metadata files to have names determined as a function of monotonically increasing version numbers. It also looks for an optional version-hint.text file pointing to the newest version.

Note: Writers maintain consistency and monotonically increasing versions by implementing the scheme documented here. Unfortunately, this requires storage systems to support atomic renaming, which many storage engines, notably S3, Google Cloud Storage, and Azure Blob Storage, do not do. This is one of the reasons why one of Iceberg’s original authors, Ryan Blue, has referred to the creation of Hadoop tables as “one of his biggest mistakes”. Even on storage systems supporting atomic renames you may see lower performance than when using a “proper” metadata catalog. The use of HadoopCatalog is generally discouraged for production use.

Snowflake presumably uses a proprietary, highly performant catalog implementation in its backend. However, it is nice enough to materialize data and metadata on the customer-managed object storage in a way compatible with the Hadoop catalog — they even maintain a current version-hint.text file! This compatibility means that any reader with support for the Iceberg Hadoop catalog can read Snowflake data directly by pointing it to the root of the Iceberg warehouse on the object storage system.

DuckDB

DuckDB has partial support for the Iceberg Hadoop catalog and filesystem tables. While DuckDB unfortunately does not (yet?) support reading an entire warehouse, you can point it to a table prefix. DuckDB will then pick up on the version-hint.text file and read the latest version of the table.

Creating an Iceberg table

Getting Snowflake to create an Iceberg table on your cloud requires some configuration. The example below uses S3 as a storage layer, but Snowflake also supports Google Cloud Storage and Azure Storage. You can find a playbook for S3 here:

On a high level, this is what needs to happen:

  • Provision storage: Create an S3 bucket and an IAM role for Snowflake and ensure that the IAM role has the necessary permissions to access the bucket.
  • Connect Snowflake to storage: Create a Snowflake External Volume. In S3’s case, an external volume will create an IAM user on Snowflake’s account. You need to create a trust relationship so that IAM user can assume the role with access to your S3 bucket.

We can finally create native iceberg tables in Snowflake with CREATE ICEBERG TABLE, and you can find your Parquet and Iceberg metadata files in the S3 bucket.

Reading data from DuckDB

Having established a secure connection between S3 and Snowflake and created Iceberg tables in Snowflake, let’s — finally — see how DuckDB facilitates querying them.

We use DuckDB’s iceberg extension to read the Iceberg tables we made in Snowflake directly from S3. Again, you can find the playbook here. The main functionality is provided by the following iceberg_scan method:

select * from iceberg_scan('s3://chapter-platform-iceberg/icebergs/line_item';)

The iceberg_scan method fetches the tables from S3. You don’t have to point to the current manifest.json file explicitly because the version-hint.text is pointing to the current version of the tables.

We have now unlocked the real power of open table formats: we have the convenience of Snowflake and its catalog but can save costs by performing single-node queries on DuckDB.

As of now, DuckDB does not support writing Iceberg tables — only reading. You can write out to parquet though, for example to S3 with COPY <table_name> TO 's3://bucket/file.parquet';. However, even if DuckDB would support Iceberg writes, Snowflake would not — although you could register DuckDB’s output as an Iceberg table with external catalog in Snowflake.

Why is Snowflake doing this?

If using Iceberg tables on Snowflake is a bit like having your cake and eating it, with Snowflake footing the bill, then why did Snowflake build this integration? The move makes sense in the context of fierce competition from Databricks. Both behemoths are trying to open up their systems to attract customers.

Snowflake sends the message to its (prospective) customers that choosing Snowflake does not tie them to one vendor and that there is no risk of lock-in; with them, you always have the option to switch compute engines when you want. Databricks is behaving similarly by opening up its Delta Lake format and better supporting Hudi and Iceberg through UniForm.

Snowflake still wants to keep as much compute as possible on their systems. There is a clear path for moving external metadata to the Iceberg catalog, but going in the other direction is much more challenging. By owning the metadata catalog, Snowflake remains the compute engine of choice and the only writer. If Snowflake had not opened up its systems, it would likely have lost many customers who were afraid of lock-in.

Conclusion

Open table formats like Iceberg enable true separation of compute from storage. By using Snowflake’s Iceberg tables, you can continue enjoying Snowflake’s powerful and operations-free capabilities, while making it possible to occasionally escape its “walled garden.” Because Iceberg with Parquet has characteristics and features that are very similar to those of native Snowflake tables — like efficient compression, partition pruning, schema evolution, etc. —, and because Snowflake has implemented support for them, you should be able to use Iceberg tables instead of native tables without a significant impact on performance or functionality. We therefore suggest defaulting to using Iceberg tables with Snowflake.

This post demonstrated how easy it is to run a query on DuckDB instead of on expensive Snowflake compute by directly pointing it to Snowflake-managed data in your own object storage. There, you can even combine it with data that is not available in your Snowflake warehouse. Knowing that you can operate DuckDB from instances that cost around 10% of a comparably powered Snowflake warehouse, such an approach can come with significant cost savings. Of course, we do not mean to suggest DuckDB is a replacement for Snowflake. We do think this is a good demonstration of the power of interoperability.

This post is the result of a collaborative effort by Jelle De Vleminck, Robbert, Moenes Bensoussia, and Jonathan Merlevede.

  • 👏 If you liked this article, don’t forget to clap
  • 🗣️ Share your insights in the comments; we will try to respond
  • 🗞️ Follow me and subscribe to datamindedbe for more posts on cloud, platform-, data-, and software engineering.
  • 👀 For more about Data Minded, visit our website.

--

--