Don’t Worry About Data Lakehouse Features, Trust in Google Search…

Kyle Weller
8 min readFeb 2, 2024

--

This week, a new data lakehouse comparison landed with a large focus on dissecting vendor support for Apache Iceberg vs Delta Lake vs Apache Hudi. I was eager to read and learn some fun new details. Seeing what was published though I was surprised by not only basic technical gaps, but also incorrect math, and new proposed metrics that are misleading and obvious to debunk. I tried to add some details into the Linkedin discussion and previously sent some corrections to the author by email, but this new metric requires some explanation to debunk and it did not seem like the author had reciprocal interest in correcting the inaccuracies in the blog.

Aside from well documented gaps in the vendor support matrix, I want to focus this discussion more constructively on the NEW metric proposed which is used as the foundation of the conclusions of Apache Iceberg “winning” and the recommendation to ignore features of Delta Lake and Apache Hudi and just choose Iceberg. This new metric, as I understand, is used to define “Gorilla Bets” by counting Google search results for Apache Iceberg, Delta Lake, and Apache Hudi on the documentation pages for vendors across AWS, GCP, Azure, Databricks, and Snowflake: https://tableformats.sundeck.io/

Innovative, except it was a Google Search fail…

On the surface this looks like it might be an interesting way to try to understand if there is an implied vendor preference for a certain data lakehouse format. But what caused me to think twice was seeing a reference that there might be 78 thousand pages of documentation for a single project. I have not seen volume like that outside of maybe things like operating system manuals.

Double clicking into this unfortunately exposed that the calculation in the blog is both wrong and misleading. In the spirit of truth seeking, please also help me see if I misunderstand the approach of the blog. To reproduce results similar to the table it looks like the author used the search url constraint as mentioned in the table and looks like single words of `iceberg`, `hudi`, `delta`, without quotes (i can’t reproduce anything close with other combinations with `apache` or otherwise). Then I assume the blog uses the “About X results (0.X seconds)” at the top of the result page as the count of documentation pages containing the keyword. Then it divides the leading number by the 2nd largest number to generate the “preference ratio”. I recognize google search results will vary daily if not by the minute so I don’t expect a 100% repro, but here is a table for keyword searches with the URL constraint of `docs.aws.amazon.com` as of 2/1:

Interesting right? Let’s inspect it a little further to understand why the difference is so large… First for some readers who might be new to Google, let’s understand, what do quotations do to a search?

Hmm… They ensure the results you get are accurate? (and maybe socially responsible? 🙂)

Here are some examples of why searching the AWS docs for iceberg without the quotes leads to inaccurate documents.

The results are full of documentation for AWS S3 Glacier because in terms of “relevance” icebergs are similar to glaciers in real world nature. Do you see any mention of Iceberg in this article? Or more importantly some documented vendor preference of Iceberg here? An iceberg outside of our tech bubble is a real object on Earth and a popular term to use in metaphors like “the tip of the Iceberg”. Hudi has no such relevance. Google occasionally confuses Delta with the airline (try searching for Delta UniForm just for kicks 🙂)

First as a principle it seems irresponsible to use such fickle data to make a decision for the technical foundation of a data platform. But more importantly the Gorilla Bets table is flat out presumption and I think recognizing this significantly alters the subjective narrative it is trying to drive users to believe.

What Does Google Mean by X Results Found?

I think the evidence presented above is more than sufficient to discredit the premise of the article, but there is still another fail here. Counting what Google reports as “about x results” as the number of documents, 1 to 1? When Google says “About X results (0.X seconds)”, what does this actually mean?

Documented in Google help forums is this interesting description:

Here is a relevant discussion started by a Google employee in classic hacker news style: https://news.ycombinator.com/item?id=32784418

Uh oh… so you mean there might not even be close to 78,000 pages of documentation for Apache Iceberg in AWS docs? Actually ya, maybe let’s add some common sense to this, 78 thousand? So for fun let’s exhaustively paginate and see how many doc pages there actually are for these search terms for AWS (again this is going to change every day/hour this is as of writing on 2/1/24):

I think “Actual Results from pagination” is still suspect to me. From the two sources I linked above they also call out that Google has an upper limit of returning 400 max results for a query if you paginate to the maximum. While these 300 numbers seem to be under that limit, I still would not put my credibility on the line to try and claim there are exactly 315 docs with “iceberg” keyword in them, there may be more.

Fails aside, what about the metric?

So across these exercises, it is clear AWS DOES NOT have a preference nor a “Gorilla Bet” on Apache Iceberg. Based on the metric definition when quotation marks are in place it seems they evenly support all 3 table formats. Also based on the authors comment in the article about having an “off-the-record” conversation with someone from AWS about this topic, I strongly urge anyone from AWS to step up and provide an “ON-the-record” statement of where their “Gorilla Bets” are placed because I hear very different things internally at AWS. This disingenuous gossip is a disease in our industry that is leveraged in pursuit of selfish gain and in this case, puts down a fast growing OSS community that has given so much to the industry at large. Let’s use real data not opinions.

If you still believe using the count of documents referencing the projects is an interesting way to see if vendors prefer one project over another, and if you can still stomach Google’s wildly inaccurate estimate included at the top of the SERP, here is what the Gorilla Bets table looks like when simply using quotation marks around the search term. The results show Databricks and Microsoft with more search results for Delta, Snowflake with more for Iceberg. Google cloud with a small 1.3x multiple for Iceberg, and AWS is almost exactly flat at 1.02x towards Delta.

The Wrap

As we get close to wrapping up, I know I am a very easy target to point fingers at and say, eww look he’s a biased vendor. So let me be the first one to point. Hi, my name is Kyle Weller and I lead Product at Onehouse.ai. Am I biased? ABSOLUTELY. I worked in the Delta Lake community ever since it was created for years. I was a product lead for Azure Databricks and I helped 100’s of organizations adopt Delta and modernize to a Lakehouse architecture. I am personally responsible for Microsoft’s deep bias and adoption of Delta Lake as I was the internal champion building support for it across the Azure portfolio. Fast forward to today our startup company is founded by the creator of Apache Hudi and the one that originated this exciting new architecture to begin with. We are building a product that writes data pipelines in Hudi, Delta, and Iceberg, all three. So you can say I am a little bit obsessed about this domain as I work across the communities. I’ve seen the evidence first-hand with so many companies through their evaluations that I definitely have a bias and preference to Hudi. Aside from the flawed measurement methodologies above, if you are interested in my personal opinions and experience: yes, Databricks and MSFT have obvious bias to Delta. I believe Google has a slight preference for Iceberg, but Delta and Hudi are still supported just fine. AWS, again my personal observation, is excellent in their neutrality and good support for all three table formats.

Under acknowledgement of my bias, my goal here is not to somehow craftily convince you of one table format or another (I have other blogs for that). I am actually just disappointed by the technical inaccuracy and damaging misinformation shared in the mentioned blog. I don’t care what format you might prefer, I hope there is a foundation of truth seeking and a recognition that the information presented and the conclusions drawn stands in need of correction. I’m happy to also share detailed descriptions of gaps in the vendor support matrix along with the math mistakes in other charts. With a stellar career as a co-founder and CTO of Dremio, which was/is dependent on Iceberg’s success, I think it is important that the author also transparently share his bias and where he stands for personal gain in the blog’s narrative.

--

--