Will Apache Iceberg Win Over Delta Lake?

Published in

Confluent

12 min readJun 6, 2024

Update: This blog post was written just before Databricks’ acquisition of Tabular. As a result, the information below reflects the landscape prior to this development.

With Databricks acquiring Tabular, it raises a couple of questions about the future of open data formats, particularly Iceberg. Will Databricks prioritize continued Iceberg support? How will the open standard for interoperability evolve to allow seamless interchange between Delta Lake and Iceberg across processing engines?

Given that Databricks acknowledges this process will take years, it’s critical to carefully evaluate both formats and choose the most open one until the situation becomes clearer.

Predicting the future in tech is a notoriously difficult task, and it’s one area where everyone seems to struggle equally. Despite this, in this post, I’m going to take my chances and try to forecast who will be the winner of the ongoing open table format (OTF) debate: Apache Iceberg or Delta Lake?

You might wonder why I’m focusing only on these two options, and not including Apache Hudi and Apache Paimon. Although Apache Hudi offers impressive ingestion performance, it hasn’t garnered as much traction as the other two. As for Apache Paimon, it’s still too new to be considered; it hasn’t released a 1.0 version, and its official specification was only recently published. If the situation changes by 2025, I’ll gladly write a new blog post to revisit the topic.

Apache Iceberg and Delta Lake are two competing table formats that abstract the physical data storage layer from users, and present the data as a set of tables instead of raw files.

Both technologies aim to transform a basic data lake into an open “data lakehouse” by combining the ability to store a massive amount of data in object stores with letting you choose the best processing engine to update and query it.

You may be tempted to base your choice solely on the feature sets, but superior technology alone doesn’t guarantee success.

With both projects being open source, boasting similar features, having strong communities, and being under active development, will this be a case where neither will prevail, like the classic Nintendo vs. Sega or Xbox vs. PlayStation rivalries?

I don’t think so. Beyond their technical merits, I believe success will ultimately hinge on these dimensions:

Community and vendor adoption
Ease of use and integration
Development stewardship
Technical foundations
Processing engines compatibility

Community and vendor adoption

Many tech companies have adopted Apache Iceberg today: Apple, Alibaba, Tencent, Bloomberg, Pinterest, Netflix, ByteDance, Adobe — to name just a few.

We’ve also seen a flurry of announcements recently coming from major vendors in the data landscape regarding their support for Apache Iceberg:

Google announced Iceberg support for BigLake in late 2022
Snowflake announced unified Iceberg Tables at their 2023 summit
Confluent announced Iceberg support with Tableflow at Kafka Summit London 2024 to bridge the operational and the analytical worlds

Actually, more than just supporting it, many vendors actively contribute to Iceberg development, such as Tabular, Dremio, Imply, and Starburst.

In the Delta Lake camp, there’s also a long list of supporting companies, such as Microsoft and Alibaba, but also businesses that aren’t Databricks customers, such as DoorDash. It turns out that businesses that are heavy users of Apache Spark for large-scale data processing often adopt Delta Lake, which is built on top of Spark and developed by the same creators.

There are also those who support both, such as AWS and Google. But vendor support isn’t enough, a healthy and active community is also a cornerstone to ensure the success of an open-source project.

Most Iceberg development discussions happen in the very active dev mailing-list, while users can chat in the Slack community, boasting over 5000 people. I find it useful that the community is on the Slack Pro plan, which grants everyone access to the full history of Slack discussions.

Delta Lake has put together a Google Group for developers and users, but it’s surprisingly low traffic, with only 20 discussion threads in the last six months, which is odd. There are some comments on the GitHub Pull Requests, but they’re mostly for code reviews, and rarely about designing new features. It has a Slack channel too, with an impressive roster of 10,000 people. Unfortunately, it’s on the free plan, so unless there’s an archive somewhere (which I couldn’t find), you will only ever see messages from the last 90 days. Another thing that bothers me is that there are quite a few channels that focus on Databricks-specific stuff. I mean, isn’t there some obvious conflict of interest here?

Winner: Apache Iceberg

Ease of use and integration

Documentation can often make or break a project. Without it, users can’t leverage the project’s capabilities, troubleshoot issues, or find answers to their questions. Thankfully, Delta Lake and Apache Iceberg both have very compelling documentation, and they both have decided to version it alongside the code, which is a good thing.

If you want to get started with Iceberg and are more of a book person, there’s a freely available one from May 2024 called Apache Iceberg: The Definitive Guide. Funnily enough, Delta Lake also has a freely available book, with an eerily similar name, also edited in May 2024, called Delta Lake: The Definitive Guide.

Now, you probably want all your strategic workloads to access your data assets, so language support and processing engine integration are crucial.

Apache Iceberg currently supports SQL, Scala, Java, and Python, and was designed specifically to be engine-agnostic. It means that you can use open-source tools such as Apache Flink, Apache Spark, and Apache Hive, but also Presto and Trino to read and write Iceberg tables.

Although Delta Lake supports the same languages as Iceberg (SQL, Scala, Java, Python), it was built with Apache Spark in mind — hence the nickname Delta Spark. Sure, it really shines when you use those techs together to write to Delta tables, providing some nice optimization perks, however those benefits are not always available when using another engine, as Delta relies on connectors to bridge the integration gap.

Winner: Apache Iceberg

Stewardship of development

Let’s get into the meat of this blog post.

Apache Iceberg was created at Netflix to overcome the shortcomings of Hive, the first table format of the Big Data era. It was donated to the Apache Foundation in 2018 and has been developed ever since — outside of the influence of any single for-profit organization. The project clearly identifies the development team, including the PMC chair, the PMC members, and code committers. The contribution process is publicly documented, and the adoption of proposals follows the Apache Software Foundation code modification model.

The Delta Lake project was originally created by Databricks, a company founded by the co-creators of Apache Spark, and was donated to the Linux Foundation in October 2019.

The main issue is that, up until early 2022, all three members of the Technical Steering Committee for Delta Lake were affiliated with Databricks. Ironically, Databricks removed this list from their contributing guide, in complete contradiction with the obligation stated in the charter they say they follow. This lack of diversity concern was also brought up nearly a year ago, yet Databricks has not taken any steps to address it.

If we dig a bit deeper and look at the code committers since 2022, we can see that the top 20 contributors are also Databricks employees, despite the project having been open source since 2019:

$ git clone https://github.com/delta-io/delta && cd delta
$ git shortlog --since=2022 --summary --numbered --all --no-merges | head -20
  195  Venki K.     (Databricks)
  161  Allison P.   (Databricks)
  141  Scott S.     (Databricks)
   84  Paddy X.     (Databricks)
   77  Prakhar J.   (Databricks)
   72  Johan L.     (Databricks)
   63  Lars K.      (Databricks)
   46  Jackie Z.    (Databricks)
   45  Dhruv A.     (Databricks)
   45  Ryan J.      (Databricks)
   44  Christos S.  (Databricks)
   42  Fredrik K.   (Databricks)
   41  lzlfred      (Databricks)
   35  Hao J.       (Databricks)
   34  Tom v. B.    (Databricks)
   31  Tathagata D. (Databricks)
   30  Andreas C.   (Databricks)
   28  Ming D.      (Databricks)
   27  Fred S. L.   (Databricks)
   26  Shixiong Z.  (Databricks)
$ # ^^ last names truncated and org name added

Maybe it’s just that the Databricks folks are very productive, and we should look at the long tail. Well, even if we extend the table above to include the top 40 committers, we still get similar results: 37 of them are working for Databricks. It’s definitely not diverse, and the risk is for features to be implemented to fulfill a single company’s agenda, not to solve the problems of the community.

Actually, the brand new Liquid Clustering feature is an example of behind-the-doors development. It has been worked on privately for months by Databricks, despite the community asking for clarification early on. The design document was eventually shared with the community several months later, with comment-only access. Since then, comments have been made about some design choices, but so far, the dev team hasn’t been responsive.

Let’s see how Apache Iceberg fares in comparison:

$ git clone https://github.com/apache/iceberg && cd iceberg
$ git shortlog --since=2022 --summary --numbered --all --no-merges | head -20
  350  Fokko D.         (Tabular)
  184  Anton O.         (Apple)
  179  Eduard T.        (Tabular)
  142  Eduard T.        (Tabular)
  123  Ajantha B.       (Dremio)
   68  Ryan B.          (Tabular)
   58  Steven Z. W.     (Apple)
   56  Amogh J.         (Tabular)
   51  Bryan K.         (Tabular)
   49  Manu Z.          (Unknown)
   48  Amogh J.         (Tabular)
   43  Prashant S.      (Amazon)
   40  Xianyang L.      (Tencent)
   37  Szehon H.        (Apple)
   34  Robert S.        (Dremio)
   33  pvary            (Apple)
   30  Daniel W.        (Tabular)
   30  Kyle B.          (Tabular)
   29  Yufei G.         (Apple)
   28  Hongyue/Steve Z. (Apple)
$ # ^^ last names truncated and org name added

Just by looking at this top 20 list, we can see that there’s a greater diversity of contributors to Apache Iceberg: from Tabular, founded by the co-creators of Iceberg and its main driving force, but also from Apple, Dremio, Amazon and Tencent. If we extend it to the the top 40, it’s even more diverse, with committers from Google, Starburst, Oracle, Netflix, Cloudera, and the University of Huazhong.

As I was writing those lines, it dawned on me that it would certainly be useful to create a score, from zero to 10, to distinguish open source projects that are strongly closed from those that are strongly open.

History has shown many times that a community-driven approach to software development will always beat a vendor-driven one when it comes to standards, and will often lead to faster development and more innovation.

Winner: Apache Iceberg.

Technical foundations

From a user’s perspective, both formats share a lot of features:

An official specification to ensure the format is well-documented (Iceberg, Delta Lake)
ACID transactions to guarantee the integrity and reliability of the data (Iceberg, Delta Lake)
Upsert/Merge support to simplify data ingestion (Iceberg, Delta Lake)
Time travel and rollback to navigate or restore data history (Iceberg, Delta Lake)
Performance optimization for faster reads and writes (Iceberg, Delta Lake)
Partitioning to speed up queries (Iceberg, Delta Lake)
Schema evolution to change table structures (Iceberg, Delta Lake)
A shared API component to hide the complex details of the protocol specification (Iceberg Core, Delta Kernel)

But there are a few differences worth noting.

Language

Iceberg is implemented in Java, making it easy to onboard contributors. Delta Lake was built on top of the Apache Spark project and is implemented in Scala, a functional programming language with a quite steep learning curve. Although Scala may sometimes be considered ill-suited for open-source projects, it can attract very experienced developers who like it, so it’s not necessarily a disadvantage.

File formats

Iceberg supports several file formats such as Parquet, Avro, and ORC, whereas Delta Lake only supports Parquet files. This might be a problem if you’re a long-time big data practitioner and already have petabytes of data stored in older file formats that you need to process.

Data Catalog

Another important piece in any table format is the data catalog, which tracks all tables and knows how to address and update them in a transactional way. Without a catalog, managing tables would be cumbersome and error-prone, due to the lack of a centralized system for tracking table locations and associated metadata.

Creating a table within a specific catalog locks you in — you can’t write to it from another catalog. This control position grants the catalog vendor significant strategic power. They essentially hold the keys to the data kingdom, able to restrict how other platforms work with your data (limiting interoperability), or to favor specific integrations that benefit them.

Iceberg doesn’t impose a specific catalog for managing tables, but comes with a set of ready-to-use options: REST Catalog, Hive Metastore, JDBC database, and the open-source Project Nessie, which brings Git-like data versioning and cross-table transactions.

In a move to promote vendor neutrality, Snowflake announced Polaris, a new open-source data catalog which will be available soon. Compatible with Apache Iceberg’s REST protocol, Polaris allows you to seamlessly connect various processing engines like Apache Flink, Apache Spark, and Trino for data management.

If you want a more integrated experience with your existing data architecture, you can also decide to pick a vendor catalog, for example, Snowflake catalog, AWS Glue and DynamoDB catalogs, Google BigLake Metastore, or more specialized ones (and cloud provider agnostic too), such as Dremio or Starburst. Using a shared Iceberg catalog allows various processing engines to share a common data layer.

Delta Lake, on its own, does not offer any data cataloging functionality, but can be used with Hive Metastore (via Spark), AWS Glue etc. Databricks recommends using Unity Catalog, as it brings governance and security capabilities, and considers Hive Metastore as legacy.

Partitioning

Another cool feature of Iceberg is Partition Evolution, which allows modifying partition layouts in a table without rewriting the entire table. This is particularly useful when data volume or query patterns change.

Delta Lake has something similar called Liquid Clustering, which I briefly mentioned above, which is a sort of fancy Z-ordering, and provides the flexibility to redefine clustering keys without rewriting existing data.

Core Library

Since Iceberg was built with a core-first approach, it provides a Java library implementing all protocols from the specification. As a result, it’s quick and less error-prone for processing engines to add support for Iceberg.

Despite its maturity and robustness, Delta Lake faces a hurdle: relying on Apache Spark for read/write access. This limitation changes with delta-rs, a Rust implementation of the Delta Lake protocol, enabling access without Spark. However, delta-rs relies primarily on open-source contributions with minimal backing from Databricks, the company behind Delta Lake. Note that a similar effort has been started for Iceberg with iceberg-rust, to further extend its reach.

Another weak point, acknowledged by the dev team, is the fragmented Delta connector ecosystem, with too many independent protocol implementations. It complicates the creation of new connectors, increases the risk of bugs as the specification evolves, and slows down the adoption of new protocol features. Databricks has heard the criticism and began bridging the gap in 2023 with the Delta Kernel (in Java), but the effort is still ongoing, and Delta Kernel still lacks several features compared to Delta Spark.

On the bright side, Delta Lake has recently added an interesting feature called UniForm that could help beat its opponent. UniForm, short for “Delta Universal Format,” is a kind of Rosetta stone that automatically generates metadata for the Apache Iceberg and Apache Hudi formats.

Be aware that it has a few limitations though:

The Iceberg metadata is generated asynchronously
Some features aren’t supported, such as deletion vectors — which brings Merge on Read (MoR)
Uniform tables are read-only for Iceberg clients, which isn’t great
Delta Sharing only works with Delta, not Uniform-enabled tables

From a technical perspective, Iceberg and Delta have been playing a game of cat and mouse, each trying to outsmart the other. This fierce competition pushes both to keep coming up with better features and innovations, which is excellent news for developers.

Winner: Tie.

Processing engines compatibility

The switch from raw data lakes to table formats and headless storage is a big change from old-school monolithic data warehouses, and has led to more flexible, scalable, and compatible data architectures.

Now, with the storage being decoupled from compute, you can choose whatever processing engine you prefer. Let’s check out the options and their support for both Iceberg and Delta Lake:

Keep in mind that not all engines support everything, so double-check each engine’s documentation before making your choice if the following features matter to you:

Rollback
Time travel
Concurrent writes

Apache Iceberg has the edge once again, with more processing engines being supported when reading from and writing to tables.

Winner: Apache Iceberg

Conclusion

Of course, success depends not only on the current state of affairs, but also on what lies ahead. In the coming months and years, with the explosion of GenAI use cases, and the projected increase of real-time streaming data, Apache Iceberg and Delta Lake will need to swiftly tackle upcoming technical challenges to adapt and prevail. It will be interesting to see if Hudi can make a comeback under new circumstances, given that it was designed precisely for real-time data access and analytics. With Iceberg and Delta joining forces (see update at the top of this post), it now seems unlikely.

The roadmaps for both projects are available on GitHub and they look very promising:

Apache Iceberg is on track to become the de facto standard in data management within the next few years. The technical foundations are sound, it’s well integrated with diverse tools in the data ecosystem, and it’s largely supported by vendors and the community.

With avoiding vendor lock-in a top concern, Iceberg’s open architecture could prove a significant advantage over Delta Lake, especially for those seeking to future-proof their data infrastructure.

Global winner: Apache Iceberg

The views expressed in this article are those of the author and do not necessarily reflect the position of Confluent.

Will Apache Iceberg Win Over Delta Lake?

Community and vendor adoption

Ease of use and integration

Stewardship of development

Technical foundations

Language

File formats

Data Catalog

Partitioning

Core Library

Processing engines compatibility

Conclusion

Written by Gilles Philippart