Data Engineering
Why Choosing Between Delta and Iceberg Shouldn’t Be a Problem Anymore
Introduction
Ever felt stuck choosing between Delta Lake and Apache Iceberg for your data storage needs?
It’s like being asked to choose between pani puri and vada pav — both are amazing, but picking one feels unfair!
But guess what? You may not have to choose anymore.
It’s time to move beyond the whole “which format is better?” debate and focus on what really matters — your data.
The Storage Wars: Delta vs. Iceberg
For a while now, data engineers have been stuck in a dilemma — should you go with Delta or Iceberg?
It’s a classic tale of two great options duking it out, kinda like the whole Blu-ray vs. HD-DVD saga back in the day.
Both systems have their own pros and cons, and frankly, it’s been a pain to choose.
What’s worse?
This confusion has left a lot of people clinging to the old Hive format, which is not exactly #DataGoals.
Ryan and Michael (creators of Delta and Iceberg) admit that this wasn’t their intention at all. They’ve put in countless hours to perfect these formats, but the real aim was never to create a competition — it was to make life easier for us data nerds.
So now, instead of staying stuck in the past, they’re joining forces to build something better.
Why Should You Care?
So, why does this matter to you? Well, the whole point of storage systems is to take away your worries, not add to them.
Imagine a world where you don’t have to worry about which storage format you should use or whether it will work with your existing tools.
It’s like choosing between Netflix and Amazon Prime without ever worrying about missing out on your favorite shows. 🧘
Michael and Ryan are working towards a unified system where choosing one over the other doesn’t have to be a make-or-break decision.
It’s about creating a seamless experience where you can focus on using your data, not sweating over formats.
The Origin Stories: Delta & Iceberg
Before we get into what’s next, let’s rewind a bit. Ryan started working on Iceberg while he was at Netflix, along with his co-founder Dan Weeks.
They were managing all the open-source data infrastructure at Netflix, including Hadoop clusters and Spark deployments. They faced the same issues that led Michael to create Delta Lake.
Michael’s journey with Delta started when he met Dominic Brezinski at a Spark Summit (now known as the Data and AI Summit).
Dominic had a massive problem: he needed a way to ingest and analyze tons of security data in real time.
Think every DHCP request, every TCP request — basically, a lot of data. Michael was leading the structured streaming team at the time, and while they were great at reading data from Kafka and converting it to Parquet files, they faced a common problem: they ended up with a million Parquet files on S3 and all the headaches that come with that.
Inspired by this challenge, Michael and his team quickly pivoted to build what would become Delta Lake. And within a few months, they were testing it on production data, solving problems that had plagued them for ages.
The Common Ground
As both projects evolved, Michael and Ryan realized they were actually solving a lot of the same problems — just in slightly different ways.
Delta uses snapshots to keep track of changes, while Iceberg creates a tree of data files and metadata. But at the core, both systems are focused on providing reliable, scalable, and efficient data storage.
Over time, Delta even adopted field IDs from Iceberg to make schema evolution easier. Meanwhile, Iceberg is working towards change-based APIs, similar to Delta’s low-latency commits.
So, both formats are getting closer and closer in terms of capabilities and functionality.
What’s Next? A Unified Vision
The dream is to get to a point where you can store your data in either format and use it interchangeably.
This means making sure that any data file written in Delta is perfectly valid in Iceberg and vice versa. Imagine being able to switch between the two without worrying about compatibility issues.
This would be a game-changer for anyone dealing with big data!
The first step towards this unified vision is to eliminate inconsistencies at the data layer. They’re also working on shared features like variant support. The idea is to push these features upstream into Parquet, so there won’t be any difference whether it’s a Delta variant or an Iceberg variant.
In short, they’re working together to make the user experience as smooth as a perfectly optimized SQL query.
What Should You Do Right Now?
While this unified future is super exciting, it’s not 100% here yet. So, if you’re wondering where to store your data today, the answer is simple: use what works best for your existing pipelines and tools.
Unity Catalog, for instance, is becoming a crucial player in this game. It allows you to access your data through a unified interface, regardless of whether you’re using Delta or Iceberg.
For now, you can keep writing your data in Delta, continue using your Databricks pipelines, and leverage Iceberg metadata as an interchange format. Unity Catalog will help smooth out the wrinkles, making it easier for you to focus on the real job — getting insights from your data.
Conclusion
Choosing between Delta and Iceberg should no longer be a headache. With both communities working together, we’re heading towards a future where your data is easily accessible, no matter what format it’s in. So let’s leave the Blu-ray vs. HD-DVD debates in the past and start focusing on what truly matters — harnessing the power of our data.
Stay tuned, because the best is yet to come.