Iceberg Tables & Snowflake
The importance of table formats and when to use Iceberg Tables on Snowflake
Table formats and why they matter
Data lakes store large amounts of data across many files in blob storage because blob storage is inexpensive, durable, and easy to add files. However, analyzing data by directly accessing the file structure in blob storage is inefficient and slow. Say you have 1,000 parquet files in blog storage. How will a query engine know if the 1,000 files represent one table with 1,000 files or two tables with 500 files each?
As shown in the figure above, a table format abstracts a collection of data files as a table so you can query it with SQL and so you can process it with better performance. There are several different table formats available, including Snowflake’s native table format and open source options like Apache Iceberg, Apache Hudi, and Delta Lake. Each is designed for different use cases, so engineers must consider their specific needs when choosing a table format. In this article, we’ll review Iceberg tables and compare them to Snowflake native tables.
Apache Iceberg Table Format
Apache Iceberg is a table format explicitly built for working with large amounts of data and brings the reliability and simplicity of SQL tables to datasets stored across many files in a data lake. Iceberg addresses long-standing consistency and performance challenges of earlier table formats such as Apache Hive. In particular, it supports SQL-like guarantees such as ACID transactions and safe, reliable schema evolution, as well as time travel. Importantly, Iceberg tables are also engine agnostic, so multiple query engines can simultaneously process data on the same tables reliably.
Developed initially at Netflix, Iceberg is now an open source project hosted by the Apache Software Foundation, where a growing ecosystem of developers work on it from companies that include Apple, Netflix, Tabular, AWS, Alibaba, Dremio, and Tencent.
Snowflake Native Table Format
Snowflake’s native table format was built with performance, security, and ease of use in mind. Snowflake’s native table format has special consideration for table metadata and storage to work very efficiently with Snowflake’s cloud architecture, which translates to better performance. To put it another way, since Snowflake designs and implements both the processing engine and the table format — Snowflake can make improvements to both, tweaking and tuning each to work better together.
Iceberg Tables on Snowflake
Snowflake makes it easy to use Iceberg tables with direct support to create, modify, and query tables in Iceberg format. In fact, you can transact with multiple Iceberg tables with Snowflake. Snowflake makes working with Iceberg tables as seamless as possible by supporting Snowflake’s feature set, such as encryption, replication, governance, and marketplace, while also supporting the interoperability of the Iceberg open-source format.
With that said, there are three main differences between the Snowflake Native table format and Iceberg tables, as illustrated below.
These differences need to be considered when deciding on which table format to choose for your next project.
Table Types Supported by Snowflake and How to Decide
Snowflake supports a broad set of table options. In addition to Iceberg tables and the native Snowflake tables we’ve been discussing, Snowflake also supports external tables that are S3 compatible and Delta tables. Also, Snowflake can support Iceberg tables as external tables for read-only use cases and regular Iceberg tables when full DML is required. So how do you decide which table type to use for your use case? The following diagram explains how.
In short, there are three key factors to consider:
- Storage Location: if the data is on-premises or in a private cloud, managed by the customer, use external tables that are S3 compatible
- If the data is in the public cloud and your use case does not require full DML (ie, it’s a read-only use case), use Apache Iceberg or Delta tables as external tables.
- If the data is in the public cloud and full DML is required, then ask yourself if the data must be stored in open formats so it can be processed by other query engines. If so, use regular Iceberg Tables (not as external tables). Finally, if open formats are not required, use Standard Snowflake tables to take the most advantage of Snowflake capabilities.
Conclusion
In general, you should use Snowflake’s native table format to fully unlock the benefits of performance, security, and automatic management from the Snowflake platform. When you have specific interoperability requirements, for example, you’re migrating from a legacy platform or tools to Snowflake, then Iceberg is a great option. The best part is that you retain the flexibility to design architectures and storage patterns that best suit your use cases–there’s no one-size-fits-all pattern.