How Iceberg Became the Industry Standard for Data Lakehouse Platforms

Tomer Shiran
Data, Analytics & AI with Dremio
6 min readMay 24, 2024

--

In recent years, the data analytics landscape has witnessed a significant shift. Open table formats, such as Apache Iceberg, enabled scale-out data warehousing directly on a data lake. This architecture has become known as a data lakehouse. Since the creation of enterprise data warehouses, such as Teradata and Oracle, customers have been locked in by proprietary data storage (i.e., table formats and metadata catalogs). Data lakes were open, but couldn’t deliver the same functionality and performance as data warehouses. Customers have adopted open table formats to achieve the combination of vendor-agnostic data representation and data warehouse capabilities and performance. In this blog post, we explore the journey of open table formats, and why the industry has selected Apache Iceberg as the open standard.

The Rise of Iceberg and Dremio’s Vision

The story begins in 2019 when Netflix contributed Iceberg to the Apache Software Foundation. Netflix was already using Iceberg internally, and once it became an open source project it was quickly adopted by many other large tech companies such as Apple, Adobe, Airbnb, and others. This move marked a strategic shift towards an open, vendor-agnostic solution for representing tables in object storage.

While Delta Lake emerged around the same time frame, it was not (and still is not) an open format. Over time, some of its source code was released, but key elements remained proprietary. For example, a Delta Lake-based lakehouse with the Databricks Unity catalog cannot accept writes from non-Databricks compute engines, and other catalogs impose serious restrictions, such as not supporting concurrent writes. Iceberg clearly stood out in that it was vendor agnostic, and that dozens of leading companies collaborate on the roadmap and actively contribute to the project. Iceberg quickly gained traction due to its open nature and compatibility with modern data teams’ needs.

Industry Adoption and Validation

Throughout 2021–2022, Apache Iceberg and Delta Lake were locked in a two-horse race. The tide began to turn in 2023 as most of the major players in the data lake/lakehouse market rallied behind Iceberg.

  • AWS adopted Iceberg as its table format for its data services, such as Athena, Redshift, Glue.
  • Snowflake adopted Iceberg as a new native format alongside its existing table format.
  • Google adopted Iceberg as its table format for its data services, such as BigQuery and BigLake.
  • Confluent adopted Iceberg as the format behind Tableflow, a technology that feeds Apache Kafka data into the lake.
  • Microsoft adopted Iceberg as a way to share data across Snowflake and Fabric.
  • Ryan Blue and several other Netflix engineers started a company called Tabular to build an Iceberg catalog service.
  • Dozens of open source and commercial projects have adopted Iceberg support as their native table format.

Developer Community

Iceberg distinguishes itself not only through its extensive ecosystem of technologies and products but also by the wide array of companies contributing to the project. This diversity brings several advantages, including rapid innovation and assurance that the project will continue to evolve independently of any single company. The following diagram illustrates the variety of contributors to the Iceberg project compared to those of the Delta Lake project:

Dremio’s Iceberg Journey

It was clear that table formats such as Apache Iceberg and Delta Lake would elevate the data lake architecture to a new level by simplifying data management and enabling data warehouse workloads on the lake. But to achieve these objectives, it was critical to ensure that the industry would embrace an open format with an open and diverse ecosystem that was not controlled by a single vendor. After all, data lakes had become popular in the prior decade in large part due to the emergence of Apache Parquet, an open file format that the industry coalesced around.

Recognizing Iceberg’s potential to revolutionize data management, Dremio became the first technology provider to embrace it in early 2021. As the creators of Apache Arrow, we were well aware of the potential of open source-driven standardization, so we embarked on a series of initiatives aimed at evangelizing Iceberg and, in parallel, integrating it seamlessly into our platform.

Awareness and Community Building

We took a variety of steps to raise awareness about Iceberg through various channels:

  • Evangelizing Iceberg through blog posts, videos, and industry events.
  • Hosting the Subsurface conference and dedicating it to Iceberg, featuring talks from project committers and data teams at Apple, Netflix, Airbnb and other companies.
  • Authoring the O’Reilly book on Iceberg (released at Subsurface 2024 in May).

Product Innovation and Integration

Dremio’s commitment to delivering an exceptional Iceberg experience led to significant product enhancements:

  • Creating a simple and flexible data ingestion capability (COPY INTO) for Iceberg.
  • Implementing full DML (e.g., INSERT, UPDATE, DELETE) capabilities for Iceberg in the Dremio query engine.
  • Rearchitecting Dremio’s internals to use Iceberg throughout the engine. For example, Reflections (Dremio’s query acceleration technology) are materialized as Iceberg tables, and the metadata of all data sources in Dremio is cached as Iceberg metadata.
  • Creating Nessie, an open source Iceberg catalog designed for modern data teams, enabling data-as-code practices such as branching, tagging, and version control that can be used by any engine (e.g., Dremio, Spark, Trino).

Iceberg’s Triumph and Future Outlook

With widespread adoption and support from Dremio and other key players in the industry (AWS, Confluent, Google, Snowflake, etc.), it is now clear that Iceberg has become the de facto standard table representation for data lakes/lakehouses.

The industry shift towards Iceberg does of course raise questions about the future of competing formats like Delta Lake and Hudi. The few vendors that have built their stack on such formats have been introducing abstraction layers such as UniForm and XTable so that their customers can at least partially benefit from the Iceberg ecosystem. However, abstraction layers always come with downsides. For example, UniForm is limited to read-only access to data (as of today), and introduces the risk of inconsistent query results as it is not ACID compliant. Ultimately, as the Iceberg ecosystem continues to expand, these vendors will likely need to deprecate competing formats and embrace Iceberg as a native format in order to remain competitive in the market.

Dremio’s Commitment to Truly-Open Table and Data Formats, Including Iceberg and Beyond

Dremio remains committed to supporting Apache Iceberg as a vendor-agnostic table format, ensuring interoperability across catalogs and engines.

  • Dremio’s Iceberg-native query engine already works with a variety of Iceberg catalogs, and we intend to support every major Iceberg catalog so that users can query tables no matter where they are. In addition, this allows users to query data with high-performance parallelism across systems (e.g., imagine a join between Iceberg tables in Snowflake, Confluent and Glue, and perhaps external sources such as Oracle and MongoDB).
  • Dremio’s Iceberg catalog, backed by the open source Nessie project, is the most advanced Iceberg catalog, providing game-changing capabilities such as the ability to work with data as code and automated data optimization.

What’s Next?

With Iceberg now established as the industry’s standard table format, the focus will increasingly shift to the metadata catalog. The metadata catalog works alongside the table format to provide a data platform that supports a variety of specialized, best-in-class compute engines. Similar to the table format, we believe companies will want to avoid being locked into a proprietary catalog and will opt for open source solutions. Therefore, Dremio will continue to invest in Nessie and collaborate with the community to deliver the most advanced open source catalog, ensuring it can be downloaded and deployed by any company without concerns of vendor lock-in. To learn more about Apache Iceberg visit our resource center.

--

--