https://github.com/unitycatalog/unitycatalog

Open Source Unity Catalog and why it matters

Advait Godbole
databricks-unity-catalog-sme
3 min readSep 7, 2024

--

Introduction

Databricks open sourced UC in June 2024 at this year’s Data & AI Summit under the auspices of the LF AI & Data Foundation. It has since gotten 2.2k stars on Github, been forked 322 times and has 57 contributors with a long list of collaborators. You can read all about it here and learn more about the technical details by reading the docs.

A bit of history

Why is this a significant development for the data and AI ecosystem? To answer this question let’s take a journey through the history of data governance.

The pre-cloud and Hadoop era

Data governance concepts are not new. In fact, the idea of Role-Based Access Control (RBAC) has been around since the advent of multi-user computing in the 1970s. In the 1980s-1990s when BI tools started gaining popularity, the need for a way to control access to data and having the ability to discover data assets started gaining prominence. Since BI tools were designed to pull data from Enterprise Data Warehouse systems, these capabilities were incorporated into these datawarehousing systems. Full support for RBAC was included thusly in the SQL:1999 standard. As the cloud computing revolution took place and data and analytics tooling started to get democratised there was an even clearer need for data governance and access control; however despite projects such as Apache Ranger and Apache Atlas having been seeded soon after Hadoop came to the forefront, data governance and cataloguing was never unified into an OSS product that could govern all types of data and AI assets.

The advent of Unity Catalog

Since those early days, other open source and commercial tools have been released in this space. Many of the commercial ones are close Databricks partners even though Unity Catalog has seen strong adoption since its inception. This is because UC was designed from the outset to be an open governance framework that integrates seamlessly with external tools. Databricks works closely with these Enterprise Data Catalogues (EDCs) and governance tool vendors such as Azure Purview, Alation, Atlan, Collibra, Immuta, Privacera to support tight integration and provide a seamless experience

The next step — open sourcing Unity Catalog

UC’s open source release marks a seminal and paradigm-shifting event in the data governance sphere with as much, if not more, of an impact as Apache Spark had on the data ecosystem. Databricks was founded to commercialise Spark and make it enterprise ready. With UC, Databricks took the opposite route — Unity Catalog was released as the first unified, open and enterprise-ready data governance framework on the market in 2023. After seeing strong adoption by Databricks customers over the course of 2023 and 2024, we executed on our philosophical goal to give back to the community and provide our customers a truly open governance framework.

Implications of UC OSS

  1. With the first OSS release, there is now a full-featured data catalog in the market that is seamlessly compatible with the Delta and Iceberg ecosystems.
  2. Enhanced interoperability between data and AI tools
  3. Expanding the data and AI ecosystem by providing a solid foundation for the ecosystem players to build upon and create a full-featured, enterprise-ready, open source governance framework.
  4. A comprehensive roadmap to bring a host of core, governance, security and quality-of-life features to life by v0.5 — v0.6 (see roadmap)
  5. Organisations can drive down total-cost-of-ownership by storing data in open source formats, owning their data and employing the best processing/serving engines for the job
  6. Spur innovation through community contributions

--

--

Advait Godbole
databricks-unity-catalog-sme

I work as a Solutions Architect at Databricks and specialise in Unity Catalog and data governance. I also study health & nutrition in my spare time.