Apache Iceberg or Snowflake Table format?

Without question, for myself and my Smart Storage Team, one of the most frequent questions we get asked is which storage option to choose. Usually, customers asks a few questions along these lines to one or more of us:

  1. Should we choose Apache Iceberg Tables or Snowflake format Tables?
  2. Do we pick one globally or on a per-use case basis?
  3. Why should we pick one over the other, and how do we know?

These are all good questions and to be up-front, the best answer is unfortunately the complicated one — it depends on your use cases. I do want to explain some of our guidance, however, so it is easier to understand which one to choose, when, and why. Just keep in mind, this is not a one-size fits all sort of situation.

Easter egg — on the trail to Snow Lake, my original codename for Iceberg at Snowflake.

A little background

Before I launch into a discussion of which storage can make sense for different use cases, I first want to talk about what storage we have, and what it means.

What is Snowflake storage?

Snowflake launched with an innovative storage system that has a few parts, the most important being:

  1. Snowflake-managed storage (CSP comparable — buckets)
  2. Snowflake native storage format (OSS comparable — Apache Parquet)
  3. Snowflake native metadata (OSS comparable — Apache Iceberg)

These three items are bundled together into the product that customers see as “Snowflake Storage” and has been used at scale by thousands of customers over a lot of data. We continue to innovate across all three axes for this Snowflake storage.

Iceberg, ahoy!

We have also announced support for Apache Iceberg in Snowflake awhile back and we have been running some (really massive) private previews to test this new functionality in Snowflake and to get honest customer feedback. There are a few key differences with “Iceberg storage” in Snowflake and it relates to the three items below:

  1. Customers bring their own storage (buckets)
  2. Apache Parquet is used instead of the Snowflake format
  3. Apache Iceberg format is used as a table format

Since the start, our goal has been to make Iceberg (and Parquet) fast and functional inside of Snowflake.

Okay, but which do I choose?

There are several factors, and items discussed here are not in any specific order because different criteria will carry different weight with different people and use cases.

Do you want to manage physical storage?

With Iceberg Tables, by design, you have to bring the physical storage. As we have shown in our demos, there is a new concept of an external volume and every Iceberg table must specify one. This means that with Iceberg, there is no storage of table data in Snowflake-managed storage. Whether this is a pro or con can depend on your use case.

For those building one of the common patterns where storage is pooled in one (or a few) central location, such as a data lake, bring-your-own storage is desirable. This effectively “plugs” Snowflake into your central data storage and can eliminate the need to move, copy, or import data. Likewise, some customers want storage costs to be allocated to their cloud provider bill, and Iceberg Tables move storage costs to your cloud provider use.

For other customers, managing storage can be a headache or a risk. For example, some teams do not have the resources to actively manage storage, and instead opt for Snowflake storage to save time and reduce complexity. In other cases, customers want Snowflake to be responsible for storage to reduce accidental or intentional data leaks.

Choose Iceberg Tables if…

  • You need, or want, to be in the position of managing your own storage location(s)
  • Other tools need access to the underlying data storage files

Choose Snowflake storage if…

  • Managing your own physical storage (e.g. buckets) is not a value-add or will be a headache
  • You do not need, or want, other systems to have access to the underlying raw data files

Why you might want to choose open format storage

One of the neat things about Apache Iceberg is the fact that is supports three file formats — Apache Avro, Apache ORC, and Apache Parquet. Based on what we have seen, Parquet is the most used format for new data (by far) while customers have a lot of historical ORC and Avro. Snowflake has announced Parquet support first, to satisfy the most common use cases first.

Customers have told us they want an open file format usually for one of two reasons — they have or will have tools that can also interoperate with that format, and/or they philosophically like having an open format. Both are very valid.

In the case of interoperability the list of tools that can work with Parquet is very long. By using an open format, other tools can work with these data files without having to import or export. There are cloud provider products, SaaS products, open source libraries, and many more applications that use an open format by default because it is interoperable.

Why you might want to choose Snowflake storage

In the case of philosophy (eg: I like open formats to know I cannot get locked in), it is a more personal (or company) opinion — a very valid option for many. This route is not free of headaches, however. For example, remember the note above a lot of historical data is not in Parquet? While not the end of the world, when you use open formats you might run into cases where past decisions can make life just a bit harder. As a comparison, in the case of Snowflake storage, we can seamlessly upgrade your storage formats to our newest format (at no cost) while if you have a lot of ORC, migrating it may cost a pretty penny.

Choose Iceberg Tables if…

  • You have use cases, now or in the future, where a specific file format would be needed or useful
  • You are OK accepting that a file format decision today may have future implications on cost (performance, re-writing files) in the future

Choose Snowflake storage if…

  • You would like seamless upgrades to new storage formats when they are available
  • Your tools are consuming data through Snowflake (eg: JDBC, SQL API) and not through the storage layer (eg: S3 bucket)

Do you have a monolithic data footprint / architecture?

My team and I will be upfront — data storage patterns come and go, and have for a long time. We don’t think that telling customers to adopt a single one-size pattern really works. We base this thinking on our industry experience seeing Conways Law playing out way more commonly than not. So, we are not going to tell you to always use a data warehouse, or a data lake, or a data mesh, or … you get the idea.

What we will say, is use what works based on your organization and how it really operates. For some, this means that a single pattern (eg: a data lake) will work across an entire company or large organization. For others, a mix of designs will work because they more flexibly reduce friction which, in turn, lowers cost and complexity (albeit sometimes unexpectedly).

If you have a monolithic architecture across your entire organization (eg: you want or have a data lake that spans a signifiant footprint of the company) than Iceberg Tables may be a good choice. This is because often teams want to bring their own tools and open formats can be something everyone agrees on.

If you have an organization that wants strong platform cohesion and fewer tools working ad-hoc on the same data, than Snowflake formats may be a better option. For example, we have found that organizations who have lived through data governance with Apache Hadoop and Apache Spark, are a bit more skeptical on a patchwork design.

Since we expect customers will use a mix, we anticipate the majority of customers will use both Iceberg Tables and Snowflake format tables. We are designing both with an eye towards interoperability and flexibility. For example, you can join across both of them, even or have one transaction span updates on both. As a fun fact, the Snowflake platform basically treats Iceberg Tables and Snowflake Tables the same — same query engine, same optiomations, and so on — but that is fodder for another post.

Choose Iceberg Tables if…

  • Your organization wants the storage location to be a common denominator between teams
  • You understand the complexities and nuance on governance and security where storage is shared (note — a future topic I want to cover is the Iceberg catalog and our approach to it)

Choose Snowflake storage if…

  • Having multiple teams work against a more centralized and gated storage layer better matches how you operate
  • You’ve tried governance and security on open (lakes, meshes, etc) and have found it did not work for your use cases

Choose both if…

  • Some teams will want a distributed storage layer for data across the maturity spectrum while others will not or do not care
  • You have some use cases that are well algined with data lakes and other use cases where data lake needs are not a concern
  • You want to mix and match so your data storage reflects your organizational structure
  • You want to ride the innovations in both open source and Snowflake as two groups of super talented people innovate

Looking into the future

Snowflake Summit is in 4 days, and while I clearly cannot spill the beans before then, I will say we have some exciting Iceberg announcements lined up. To say Iceberg is a priority in Snowflake is an understatement.

If you want to learn more, the following sessions at Summit will be all about Iceberg, in addition to mentions elsewhere:

I will clearly be at Summit, along with Ron Ortloff, Scott Teal, and some of the crew from Tabular. If you are interested in connecting, please just shoot me a message on LinkedIn.

Finally, I want to emphatically thank everyone at Snowflake, in the OSS community, and all of our customers who have and continue to contribute to our Iceberg support. All of you are awesome.

--

--