The Future of Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Himanshu Gaurav
3 min readMay 21, 2023

--

As we reflect on the future, one important question to consider is the level of interoperability we can expect to see among the three dominant table formats( Delta Lake, Iceberg, Hudi) across various compute engines on different cloud platforms. This issue is of great significance as it impacts the ability of users to seamlessly access and utilize data across different platforms and compute engines. The essence of this article revolves around what the future holds for table formats.

Within the data engineering industry, it is imperative to address the gap that exists between the introduction of new disruptions and their integration into cloud platforms. The debate surrounding table file formats (such as Delta Lake, Iceberg, and Hudi) is a prime example of this, as selecting the appropriate format can significantly impact performance, usability, and compatibility. The specifics of table formats and their comparisons will not be covered here, as the internet is swamped with so many articles on the same topic(ask chatGPT). You can find the link to a few of those informative articles at the end of this blog.

The Current State Challenges

The current era is marked by the rise of open source, particularly in the realm of data lake table formats(Delta Lake, Hudi, Iceberg). This trend indicates that we expect to see more data stored in these formats in the future. The rationale is self-evident, as it relates to the open data architecture driven by 1. Openness 2. Composable 3. Heterogeneity/Interoperability (https://medium.com/@DataEnthusiast/open-data-architecture-at-scale-on-cloud-part-1-3381b411533f).

The interoperability is becoming a challenge with the three table formats given support for only one of the file formats across the variety of compute engines on the three major cloud platforms. There are many cloud-agnostic players as well which support only one of the file formats. It's worth noting that several SQL compute engines can support either one of the three file formats. However, it may require a workaround using manifest files, which poses maintenance and support challenges.

The Future State Solution

The future could hold a potential solution where the first option could support all three major table formats available across various compute engines in native format on the three major cloud platforms.

Compute Engines supporting only one of the Table Formats mostly

The second option could be a Generic API layer sitting between the compute engine and table formats, facilitating the translation and support across the compute spectrum.

Compute Engines Supporting all the Table Formats via. API Layer

The success of open-source table formats lies in their ability to facilitate interoperability. Many data practitioners prefer to integrate multiple technologies when creating ETL pipelines. They export data using one data processing engine and effortlessly import it with another to continue their analysis using a different compute engine.

Below are some articles on data lake table format that you may find helpful.

OneHouse: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison

Dremio: https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/

Databricks: https://www.youtube.com/watch?v=Wx8G08jaedo&ab_channel=Databricks

I hope you found it helpful! Thanks for reading!

For other Data Space topics, please click on the below link.

https://medium.com/@DataEnthusiast

Let’s connect on Linkedin!

Authors

Himanshu Gaurav — www.linkedin.com/in/himanshugaurav21

Bala Vignesh S — www.linkedin.com/in/bala-vignesh-s-31101b29

--

--

Himanshu Gaurav

Himanshu is a thought leader in data space who has led, designed, implemented, and maintained highly scalable, resilient, and secure cloud data solutions.