Comparison of Data Lake table formats

Priyanka Srivastava
4 min readSep 19, 2022

--

Table Formats

Table formats allow us to interact with data lakes as quickly as we interact with databases, using our favourite tools and languages. A table format will enable us to abstract different data files as a singular dataset, a table. Data in a data lake can often be stretched across several files.

What features are expected to be for the Data lake?

Comparison between Iceberg, Hudi and Delta Lake

Apache Iceberg, Apache Hudi, and Delta Lake. All three take a similar approach to leverage metadata to handle the heavy lifting. Metadata structures are used to define:

  • What is the table?
  • What is the table’s schema?
  • How is the table partitioned?
  • What data files make up the table?

While starting from a similar premise, each format has many differences, which may make one table format more compelling than another when it comes to enabling analytics on your data lake.

comparison with Apache Iceberg, Hudi and Delta Lake

ACID(Atomicity, Consistency, Isolation, Durability)

Apache Iceberg

Apache Iceberg’s approach is to define the table through three categories of metadata. These categories are:

  • “metadata files” that define the table
  • “manifest lists” that define a snapshot of the table
  • “manifests” that define groups of data files that may be part of one or more snapshots

Query optimization and all of Iceberg’s features are enabled by the data in these three layers of metadata. Iceberg provides snapshot isolation and ACID support through the metadata tree (i.e., metadata files, manifest lists, and manifests). When a query is run, Iceberg will use the latest snapshot unless otherwise stated. Writes to any given table create a new snapshot, which does not affect concurrent queries. Concurrent writes are handled through optimistic concurrency (whoever writes the new snapshot first, does so, and other writes are reattempted). Beyond the typical creates, inserts, and merges, row-level updates and deletes are also possible with Apache Iceberg. All of these transactions are possible using SQL commands.

Apache Hudi

Apache Hudi’s approach is to group all transactions into different types of actions that occur along a timeline. Hudi uses a directory-based approach with files that are timestamped and log files that track changes to the records in that data file. Hudi allows you the option to enable a metadata table for query optimization (The metadata table is now on by default starting in version 0.11.0). This table will track a list of files that can be used for query planning instead of file operations, avoiding a potential bottleneck for large datasets. Apache Hudi also has atomic transactions and SQL support for CREATE TABLE, INSERT, UPDATE, DELETE and Queries

Delta Lake

Delta Lake’s approach is to track metadata in two types of files:

  1. Delta Logs sequentially track changes to the table.
  2. Checkpoints summarize all changes to the table up to that point minus transactions that cancel each other out.

Delta Lake also supports ACID transactions and includes SQL support for creates, inserts, merges, updates, and deletes.

Partition evolution

allows us to update the partition scheme of a table without having to rewrite all the previous data. Apache Iceberg is currently the only table format with partition evolution support. Partitions are tracked based on the partition column and the transform on the column (like transforming a timestamp into a day or year).

Schema Evolution

As data evolves over time, so does table schema: columns may need to be renamed, types changed, columns added, and so forth.. All three table formats support different levels of schema evolution.

Schema Evolution

References

Snowflake Integration

https://docs.databricks.com/delta/snowflake-integration.html

https://www.snowflake.com/blog/expanding-the-data-cloud-with-apache-iceberg/

Fivetran Integration

Performance

https://databeans-blogs.medium.com/delta-vs-iceberg-vs-hudi-reassessing-performance-cb8157005eb0

Others

https://www.fivetran.com/blog/iceberg-in-the-modern-data-stack

https://www.dremio.com/subsurface/comparison-of-data-lake-table-formats-iceberg-hudi-and-delta-lake/

https://iceberg.apache.org/docs/latest/spark-writes/

https://www.iteblog.com/ppt/sparkaisummit-north-america-2020-iteblog/a-thorough-comparison-of-delta-lake-iceberg-and-hudi-iteblog.com.pdf

--

--