Delta, Hudi, Iceberg — A Benchmark Compilation

Kyle Weller
5 min readAug 28, 2023

--

Performance benchmarks rarely are representative of real life workloads, so you should always run your own analysis against your own data. Nonetheless benchmarks can serve as an interesting data point while you start your research into choosing a data lakehouse platform built on Delta Lake, Apache Hudi, or Apache Iceberg. This article is a compilation of several noteworthy benchmarks published by different organizations.

Databeans and Onehouse

Databeans worked with Databricks to publish a benchmark used in their Data+AI Summit Keynote in June 2022, but they misconfigured an obvious out-of-box setting. Onehouse updated the benchmark here:
https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-transparent-tpc-ds-lakehouse-performance-benchmarks

Brooklyn Data and Onehouse

Databricks asked Brooklyn Data to publish a benchmark of Delta vs Iceberg:
https://brooklyndata.co/blog/benchmarking-open-table-formats

Onehouse added Apache Hudi and published the code in the Brooklyn Github repo:
https://github.com/brooklyn-data/delta/pull/2

A clear pattern emerges from these benchmarks, Delta and Hudi are comparable, while Apache Iceberg consistently trails behind as the slowest of the projects. Performance isn’t the only factor you should consider, but performance does translate into cost savings that add up throughout your pipelines.

A note on running TPC-DS benchmarks:

One key thing to remember when running TPC-DS benchmarks comparing Hudi, Delta, Iceberg is that by default Delta + Iceberg are optimized for append-only workloads, while Hudi is by default optimized for mutable workloads. By default, Hudi uses an `upsert` write mode which naturally has a write overhead compared to inserts. Without this knowledge you may be comparing apples to oranges. Change this one out-of-the-box configuration to `bulk-insert` for a fair assessment: https://hudi.apache.org/docs/write_operations/

Is TPC-DS sufficient?

This is where things get interesting… TPC-DS is a gold standard for benchmarks and has a long-standing history for comparing performance of OLAP systems. While it is a great baseline, TPC-DS doesn’t stress test core components of Delta, Hudi, Iceberg which are relatively new innovations in the last few years. Nor does it measure how the system performs over time as you continue operating your data lakehouse.

I have personally worked hands-on with hundreds of customers who have built data lakehouse systems I have seen the behind the curtains comparison tests and how the baseline TPC-DS starts to drift into an inconsequential reference.

Just a few months ago, a research lab at Microsoft published a new paper called LST-Bench. They make clear that the purpose of their paper was not to declare winners or losers, but instead to propose a new benchmark framework that the community can leverage for more holistic comparisons. Their in-depth research is well worth the read and I believe their benchmark is headed in the right direction for the community.

The benchmark framework they propose takes a TPC-DS base workload then it applies a series of data mutations, concurrency, and exercises the data maintenance tasks across the projects such as small file optimizations, cleaning, compaction, etc. In addition to measuring pure performance by runtime and throughput, they also introduce metrics such as Longevity and Resilience which describe how your data lakehouse performs over time as the tables grow. Read the paper to learn more details.

Performance benchmarks from Walmart

The last benchmark to reference in this compilation is a comparison written by Walmart. The Walmart engineering team did comprehensive research and comparison of Delta Lake, Apache Hudi, and Apache Iceberg and they documented their results here: https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b

The three projects were evaluated by Walmart across a comprehensive weighted matrix which considered availability, compatibility, cost, performance, roadmap, support, and TCO

The best part about this performance benchmark is they didn’t run some synthetic workload… They used their real data and they described the intimate details of their workload patterns for the benefit of the community to understand their results.

Workload 1 (WL1) is a time-based table that suffers from significant late arriving records and read/write amplification across many partitions. Workload 2 (WL2) maintains row level upserts with low latency via change data capture from a multi-TB Cassandra table. For measuring query performance, they had advanced query patterns like aggregate partition count, needle in a haystack predicated on row key, 3-way table join on row keys, and more.

The results? Delta failed OCC background compaction on ingestion… Iceberg failed writes altogether… Apache Hudi was the fastest, and was able to scale to their demanding workloads.

The most common excuse I hear written as rebuttal to deflect these results is that the versions of the projects are old now, try it on the latest version… The latest versions of Iceberg and Delta have not closed the gaps that lead to these documented systematic failures. You can find the explanation of these differences discussed in length on this video. The fundamental gaps include a combination of advanced tuning capabilities inside Hudi MoR write mode, but most importantly the non-blocking asynchronous operation of table services.

In Summary

Delta, Hudi, and Iceberg are all amazing projects with strong momentum behind each. There is no single right answer of how to choose so remember to run your own tests. Your individual workloads, query patterns, and individual feature requirements are all vital factors to determine what choice is right for you.

One pitfall I’ve seen for beginners is when they try out a hello-world sample. Some of these projects are thin formats that seem easier to use on the surface. Hudi has a few more knobs and configurations that at first can feel like a burden for a hello-world demo. But when you actually build a pipeline to operate the workload in production, Hudi suddenly becomes a piece of cake while it becomes challenging and tedious to manage your tables with some of the others. First-hand experiences and examples of this are shared in this video: https://www.linkedin.com/events/deepdive-hudi-iceberg-anddeltal7095484265877950465/comments/

--

--