First major advancement in 16 years for open data ecosystems

Published in

DBSQL SME Engineering

4 min readJun 21, 2024

https://github.com/unitycatalog/unitycatalog

If you like history or have been around long enough in this industry, read on to understand the significance of this moment where the first viable alternative to the Hive Metastore is launched.

June 19th, 2008

Hive was open sourced with this modest apache.org JIRA “Hive as a contrib project” , described with “Introduced Hive Data Warehouse built on top of Hadoop that enables structuring Hadoop files as tables and partitions and allows users to query this data through a SQL-like language using a command line interface.”

Hive brought “Data Warehousing”, to big data, to multiple file formats. Hive Metastore (HMS) provided Database, Table, Partition, and View abstractions over a distributed file systems (first HDFS then on to object stores). Hive wasn’t an amazing data warehouse but it was amazing because it was the first Warehouse-like experience on Massive “Big Data” Open Data Formats and that Hive itself was open sourced. The data catalog, HMS became the de-facto standard for cataloging technical metadata for big data projects. HMS was supported by not only Hadoop vendors such as Hortonworks, Cloudera, MapR, IBM, Teradata but also became the standard catalog interface for Databricks and AWS Glue Catalog. Federated query engine Presto / Trino supported HMS as did every vendor and open source project in the big data ecosystem.

Hive Metastore (HMS) however had several important limitations:

Weak security model, with a “back door” allowing direct file access bypassing table-level permissions. This led to data access inconsistencies between file-based access controls and table grants.
No fine grain access controls (FGAC), no lineage. Since all of the available query engines (Hive, Tez, Impala, Presto, Spark) ultimately relied on filesystem paths, there was no practical way to enforce row filters, column masks role or attribute based access controls. Specific vendors (Hortonworks, Cloudera) would implement FGAC but their approach wasn’t integreated into HMS and were not widely adopted into the broader.
No unified governance for files across different file systems (HDFS, S3, ADLS, GCS, etc), each with their own authentication and authorization. HMS did not provide any kind of security token vending thus clients were tied to the filesystem implementations.
No governance at all for analytical assets such as ML/AI models

While HMS has this “backdoor” we find many customers have built processes (e.g. storage cost attribution) that rely upon these backdoors. This tech debt will take some time to unwind.

June 13th, 2024

Just 6 days shy of 16 years after Hive was open sourced, Matei Zaharia, like a cult hero or rock star, pressed the red button in Github’s “Danger Zone” to open source Unity Catalog on github in front of a live audience of 1000’s of people attending the DAIS 2024 conference in SanFrancisco CA with many more thousands watching online.

https://github.com/unitycatalog/unitycatalog/blob/main/uc.png

As Kyle Weller, Head of Product @ Onehouse.ai, stated in his post (with less hyperbole than this post 😊):

“To me it seems like the first credible attempt in an open source catalog that could be useful across a broader data ecosystem.”

Unity Catalog OSS key benefits

In digging into the details of the announcement and github repo we find:

Unified governance across Data and AI assets: Unity Catalog enables centralized access control, auditing, and lineage tracking for tabular data, unstructured data, functions and AI assets like machine learning models and generative AI tools. This simplifies management at scale.
Platform independent: Unity Catalog (UC) is not tightly coupled with any specific compute platform. UC is thus far with DuckDB, Apache Spark, and Trino (Formerly Presto). The OSS API is operational with Databricks Managed Unity Catalog deployments. Demonstrations of DuckDB, Spark on AWS’s EMR and other UnityCatalog OSS integrations arising all within the first 48 hours.
Support for any AI asset format: Unity Catalog provides a universal interface that supports AI assets in any format, allowing organizations to manage their entire AI portfolio in a single catalog.
Simplified AI asset management: By consolidating AI asset metadata and governance in a single catalog, Unity Catalog reduces the complexity of managing AI assets across multiple platforms and tools. This saves time and resources.
Data Sharing: Unity Catalog interoperates with Delta Sharing, the only open protocol for sharing up to TBs of data.

Next Steps

UnityCatalog OSS is at it’s very beginnings, building community fast, with 1500+ stars and 22 pull requests in a week. The reference server implementation is in place, a Rust based reference implementation has already been donated and merged and the Databricks UC OSS server APIs are in private preview. The governance structure is being put into place under the Unity Catalog is currently a sandbox project with LF AI and Data Foundation (part of the Linux Foundation). I would hope the reference implementation adds Grants/ACLs, auditing, security, tags and much much more in the coming weeks. There is tons of initial excitement, time will tell if this sustains into the months and years through robust community support and commercial adoption.

In summary

Unity Catalog OSS streamlines Data and AI asset management by providing a unified, secure, and interoperable catalog that supports multiple data formats, functions and any AI asset format. This improves security, and governance while simplifying operations for Data and AI teams.