When To Use Iceberg Tables in Snowflake

Snowflake had lots of cool announcements at their Summit 2023 event — one I found highly interesting was the maturation of Iceberg tables within the platform. The exciting part to me was not necessarily the performance improvements, which are certainly welcome and great!

What excites me is how the catalog integration has made great strides. Prior to this, it was difficult for others to read Snowflake created Iceberg tables or Snowflake to consume those created by others. While possible to do before, it involved a lot of effort around snapshot maintenance, which goes away once a catalog is truly supported — the other big change we saw at Summit. New to Iceberg tables? Let’s take a look at how they work and what makes them cool below.

What is Iceberg?

Moving away from on-premise Hadoop, modern data lakes typically host data at rest in object storage (like AWS S3) on the cloud — typically in columnar compressed parquet files. These files don’t support updates, and neither does object storage itself. Iceberg is a table format for storing data that allows for ACID compliance; this includes alters, updates, and deletes against tables based on underlying parquet data. Essentially a snapshot is taken of table state once changes are made to that table.

What is an Iceberg Manifest?

A manifest consists of metadata that points to all relevant data files that make up any particular snapshot. “This is what the table looks like as of this commit timestamp.” As updates, deletes, alters, etc. are applied to the table, additional snapshots capture the state of the table as it evolves. These manifest snapshots are tracked in a catalog.

What is an Iceberg Catalog?

The catalog defines the table and which manifest to use for any given query. Most queries will likely use the latest snapshot; the current state of the table. However it is possible to request in the SQL to use a table as of a certain version or point in time (Iceberg time travel). Essentially we use the catalog to pick a snapshot of the table, and use the manifest associated with that snapshot to guide us to the underlying data that represent it, which is then consumed by the SQL engine.

Iceberg table representation from Dremio.
Iceberg table structure from Dremio.

Should Iceberg Be Used for Time Travel?

The catalog keeps track of each manifest snapshot, and the tables can be queried AS OF a commit version or certain timestamp. Should this be done? Is it the same as Snowflake Time Travel? I’d argue no all around; I believe the timestamp requires an exact timestamp of a commit, which is a bit of a pain to use, and would be required at each different table level if making a join as well. Snowflake’s AT is a bit more forgiving around time, although contains a limited retention period.

I would not actually encourage this for point in time reporting use cases — note Snowflake discourages this for their own time travel. As a best practice I would advise accomplishing this via data modeling Slowly Changing Dimensions.

Time travel is also popular and a core functionality of feature stores; it is necessary to create accurate historical data sets for creating machine learning training datasets. However this form of time travel is not table level, but is row level. You are better off leveraging a Feature Platform like Tecton, or Feature Store like Feast to facilitate what amounts to effective joins at arbitrary points in time on a row level basis.

What is Iceberg time travel good for? I think the best use case would be for something like a data audit, where the requirement was simply to show the state of a table at a given point in time.

What Makes Iceberg Catalogs So Special in Snowflake?

Previously one could mount an Iceberg table — however it wasn’t very interoperable. Maintenance was necessary in Snowflake to update the mapping to the latest manifest every time one occurred. A similar effort had to be made for an external system to read an Iceberg table Snowflake had created. Thankfully this has been addressed within Snowflake — its own catalog is now available for use, and it can use external catalogs as well!

What makes Snowflake the ultimate Iceberg consumer is that this process is both 1. relatively easy to do and 2. easily done at the table level. Note when reading a Snowflake Iceberg table in Spark, the catalog is specified at the level of the Spark session itself. Spark code from Snowflake’s blog consuming a Snowflake Managed Iceberg table as follows:

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.13:1.2.0,net.snowflake:snowflake-jdbc:3.13.22
# Configure a catalog named "snowflake_catalog" using the standard Iceberg SparkCatalog adapter
--conf spark.sql.catalog.snowflake_catalog=org.apache.iceberg.spark.SparkCatalog
# Specify the implementation of the named catalog to be Snowflake's Catalog implementation
--conf spark.sql.catalog.snowflake_catalog.catalog-impl=org.apache.iceberg.snowflake.SnowflakeCatalog
# Provide a Snowflake JDBC URI with which the Snowflake Catalog will perform low-level communication with Snowflake services
--conf spark.sql.catalog.snowflake_catalog.uri='jdbc:snowflake://.snowflakecomputing.com'
spark.sessionState.catalogManager.setCurrentCatalog("snowflake_catalog");
spark.sql("SHOW NAMESPACES").show()
spark.sql("SHOW NAMESPACES IN my_database").show()
spark.sql("USE my_database.my_schema")
spark.sql("SHOW TABLES").show()

Note in particular that a catalog is specified at the session level. This is in contrast to Snowflake consuming a Snowflake Unmanaged Iceberg table (e.g. the opposite — Snowflake reading a Spark created table) — which in Snowflake is treated simply as any other table. Snowflake doesn’t have this concept of switching catalogs; everything is simply going to be made available in one database level catalog. Tables in it will be easily joinable with anything else and feel just like any other table inside Snowflake.

When Should You Consider Using an Iceberg Table?

You have several options now creating a table:

  • Native — the best for standard use cases
  • Snowflake Managed Iceberg — when data needs to be in an open format, consumable by external processes, and where Snowflake is maintaining the table (and catalog)
  • Snowflake Unmanaged Iceberg — when Snowflake needs to read open format Iceberg data, but is just a consumer; where it is referencing an external catalog (AWS Glue Data Catalog for example).
  • External Tables — simply read only data hosted in object storage; csv files or vanilla parquet files for example.

What should be used is ultimately going to depend on use case. Your default should always be Snowflake Native tables to take advantage of micro-partitioning and clustering options native to the platform there.

I would suspect most likely that an external process that writes Iceberg which Snowflake simply reads is the most likely pattern. These squarely fall into Unmanaged Iceberg tables.

Should another tool you are looking to share Snowflake maintained data with default to Managed Iceberg? Not necessarily. Spark can read Iceberg — but Spark can also run queries in Snowflake. Is the consuming process/developer able to apply necessary logic in Snowflake via SQL? Perhaps the best workflow is still to have Snowflake apply compute towards the data and simply pull the results out to Spark, in contrast to Spark simply reading all the raw data to then apply logic to. Maybe this table should be Managed Iceberg; maybe not.

How about other tools, like Dremio? Trino? AWS Athena (a Presto service)? These simply federate SQL and make more sense to have access to the raw table data. Here a Managed Iceberg table does make more sense — although these tools will need to understand the Snowflake catalog to be able to consume the data. Keep this dependency in mind as today the Snowflake Catalog SDK must be leveraged by these consuming tools. Snowflake is working to additionally support a catalog REST API which should make integration easier to a broad set of consuming tools.

External Tables should strictly be for CSVs or regular parquet files, where all are consumed and used for current state of a table. Note that materialization should be considered for these; or partitions so that read pruning is employed to avoid full table scans.

Ready to Try Them Out?

Currently Iceberg tables are in private preview, although you can reach out to your account team to have them turned on and try them out for your use cases.

--

--

Mike Taveirne
Mike Taveirne

Written by Mike Taveirne

Consultant & Solution Architect. Enjoy DW/BI platforms Netezza and Snowflake, AutoML platform DataRobot, and currently at Feature Store category creator Tecton.

Responses (2)