Why Did Databricks Open-Source Unity Catalog?

StarRocks Engineering
StarRocks Engineering
6 min readAug 26, 2024

Did you know Databricks open-sourced Unity Catalog? You’d be forgiven, if not. After all, the same week it was announced at Databricks Data + AI Summit, the news cycle was dominated by news of their acquisition of Tabular. Despite Tabular stealing the spotlight, both of these decisions are not unrelated. Instead, this hints towards a more substantial shift in direction by Databricks that will shake up the open source space and the commercial landscape for data and AI.

In the following sections, we’ll explain why Databricks open-sourced Unity Catalog, what this will mean for the way you develop your data architecture, and provide some insider information on the moves Databricks is making in light of the announcement.

Let’s Talk Tabular

Before we delve deeper into what’s going on with Unity Catalog, we need to discuss Tabular’s acquisition. What does this have to do with Unity Catalog? Three major things matter here:

  1. The bidding war for Tabular
  2. The impact on Apache Iceberg
  3. The future of open data lakehouses

The bidding war for Tabular

There’s a lot to be said about the entire acquisition process for Tabular, but you should focus on the players in this process. It wasn’t just Databricks bidding on Tabular. In fact, with Tabular ending up selling for well over $1B, it’s clear that there was intense, well-funded interest in the company. Most notably, this competition included Snowflake, another analytics leader. Combining this with the already strong technical investments in Iceberg by Datbricks and Snowflake and their Unity Catalog and Polaris summit announcements, we’re seeing some serious movement in the lakehouse and open source space. This will continue to have a ripple effect. As Databricks and Snowflake go, so go the rest of the analytics market.

What’s the future of Apache Iceberg?

Speculation abounds as to the impact of acquiring Tabular on the future of Iceberg. True, Tabular is commercial and Iceberg is open source, and even contribution-wise, Tabular isn’t even the top contributor to Iceberg. Still, denying any impact on Iceberg from Tabular’s acquisition seems misguided. Technical contributions are one thing (a big thing), but successful projects also require a strong and well-supported community to encourage continued investment and focus on developing the right features. It’s that community support and involvement where Tabular was a clear leader.

Apache Iceberg contributor bubble chart — Source: https://medium.com/@kywe665/delta-hudi-iceberg-which-is-most-popular-29ca56767199

Now, under ownership by Databricks, the future is less certain. While there is currently no reason to doubt that Databricks will be as good a steward of Tabular’s pre-acquisition efforts as possible, they purchased Tabular as a business decision, not an altruistic endeavor. As soon as Tabular’s Iceberg activities are no longer necessary, it’s hard to say what Databricks will do. One thing is certain though, where Apache Iceberg was tied inextricably to Tabular, Databricks isn’t.

Great news for open computing

This isn’t all doom, gloom, and uncertainty though. Databricks owning Tabular indeed muddles Tabular’s focus on Iceberg, but this also means Databricks has an even greater interest in helping to support Iceberg and the open lakehouse ecosystem the project it is helping to empower. If anything, it now means the Tabular team will have even more resources to devote to strengthening Iceberg going forward. This will only further increase the viability of the open lakehouse concept.

Individually, these are all important points, but taken together we start to see a greater competition between behemoths in the analytics market, both looking for substantial growth opportunities and seeing them in the lakehouse space. What adds more spice is how intertwined open source is throughout the story. Nowhere is this more obvious than the recent announcements about open-sourcing Unity Catalog and Polaris.

Matei Zaharia, Databricks CTO announcing the open source version of Unity Catalog at Databricks Data + AI Summit 2024

Competing Announcements: Unity Catalog vs. Polaris

Here’s an interesting fact: Snowflake Summit and Databricks Data + AI Summit often overlap, and this year Databricks made the proactive decision to move their conference to avoid having to share media and market attention. This makes sense. These conferences are designed as huge announcement weeks, focused on maximum exposure of the news that matters most to these companies.

Both companies made huge announcements signaling the future of the open data lake: Unity Catalog going open source, and Snowflake’s Polaris doing the same. Tie that back to the bidding war over Tabular which wrapped up within the two weeks of both conferences, and you start to see some alignment in the strategic visions of both businesses.

And it’s not just that the lakehouse concept is continuing to gain adoption, but a more open approach is where the interest appears to be. So much so, that two of the biggest players in data analytics are putting their money where their mouths are and seriously investing in or donating projects to open source.

A New Era for Open Lakehouses

Where Databricks and Snowflake go next remains to be seen, but the immediate impact on the legitimacy and resources being injected into the open lakehouse concept can’t be ignored. This is a tremendous boon for the lakehouse community. Databricks and Snowflake have an outsized impact on the investments of nearly every other company in the analytics space. Where they go, others will follow, and open lakehouse users will get to reap the rewards. More tools, more choices, and more support will only make the open lakehouse concept more accessible and a hotbed of new projects. Expect to see more positive changes in this space in the coming weeks, months, and years.

Understanding Why Databricks Open-Sourced Unity Catalog

Bringing the points outlined above together makes it easy to infer why Databricks decided to open source and donate Unity Catalog:

  • Signaling the right time for lakehouse investments: Databricks and Snowflake are effectively signaling that the time is ripe for the lakehouse. The increased investment from these giants highlights the value of lakehouse architectures in empowering users with choices that are free from vendor lock-in. Open file formats and table formats have already become standard, and data catalogs were the last frontier where users could still be restricted. By open-sourcing Unity Catalog, Databricks is making a strong commitment to eliminating this challenge.
  • Maturity of the lakehouse space: The move to open source Unity Catalog also indicates that the lakehouse space has reached a level of maturity that justifies bigger investments. This maturity is not just in terms of technology, but also in the ecosystem of developers, tools, and users who can now contribute to and benefit from open-source innovations.

What Comes Next for Unity Catalog

The open-sourcing of Unity Catalog marks a significant milestone, but it also raises the question: what comes next? The answer lies in the expanding ecosystem of solutions that are already on board with Unity Catalog’s open-source vision. This early adoption by key industry players indicates a promising future for the open lakehouse architecture.

Unity Catalog architecture options

Several notable companies have expressed their support for Unity Catalog OSS, including AWS, Nvidia, Confluent, LanceDB, StarRocks and many more. These organizations recognize the value of an open catalog system and are poised to integrate and innovate on top of this foundation.

What To Do With This Information?

This is a clarion call to all engineers sitting on the sidelines, and those just dipping their toes in the lakehouse, to get serious about this powerful approach to their data architecture. There has never been a better time to go open.

How to get started? Begin your search with lakehouse query engines. From Apache Iceberg to Unity Catalog, the vast majority of your performance comes down to selecting the right engine. For that, you’ll want to check out StarRocks and join StarRocks’ Slack to get all the community insights you’ll need to navigate this new era for open lakehouses.

--

--