From Medallion to Mesh: How Unity Catalog Transforms Data Strategy

Jason L. Miles
Neudesic Innovation
6 min readMay 28, 2024

Databrick’s Unity Catalog represents a paradigm shift in many ways, but one of the most under appreciated elements is its views. These views represent a massive change to the way we think about the medallion architecture and the lakehouse. In addition to being a critical part of the effort to reduce the number of copies of data that exist around an organization’s data estate, they also can form the building blocks of data mesh and data fabric architectures and can even be leveraged to create data products, which all become especially powerful as an organization increases its data strategy maturity and its analytics maturity. With two conceptually simple features, Databricks has turned the view from a useful part of the data engineering toolbox into an indispensable tool for every architect and even for citizen developers, especially in the day of generative artificial intelligence (GenAI).

Medallion Architecture: Understanding the three layers

Before diving into the Unity Catalog, it’s important for us to define our terms, especially those that are a part of the medallion architecture. This approach organizes data processing in three layers: bronze, silver, and gold.

  • The bronze layer is where raw data is ingested from various sources. This is the first step in the data processing pipeline, where data is collected and stored in its raw form.
  • In the silver layer, the data is cleansed, transformed, and enriched. This is where the data is processed and prepared for analysis, with any errors or inconsistencies being corrected.
  • The gold layer contains business-level aggregates and metrics that are ready for decision-making processes. This is the final stage of the data processing pipeline, where the data is presented in a form that is easily understandable and actionable for business users.

This tiered approach helps in managing data lineage clearly and efficiently, making it easier for organizations to trace how data transforms across its lifecycle. The medallion architecture is by far the most common architecture for lakehouses, and we will be referencing it throughout this piece, but Unity Catalog can work with any of the many different reference architectures, notably Data Vault 2.0 or Lambda.

A visual depiction of the process of loading data to a lakehouse. Data is loaded from many sources to the bronze layer through ingestion notebooks. Additional notebooks transform the data and move it to the silver layer, and finally, a third set of notebooks transforms and aggregates the data while moving it to the gold layer.

Building Blocks for a new Lakehouse

By combining views with user-defined functions (UDFs) and Unity Catalog’s permission model, we start to build a new, more fluid approach. Users (or in the best practice, group members) are then able to see, using customized permissions, exactly what they are allowed to see, but only one object must exist. Even more powerfully, the view executes as the view’s owner, meaning that the user does not need permissions to either the underlying data, or to the functions that are used to make the view work.

Using this principal, there are three building blocks that can be used to create highly dynamic tables for end users:

1. Row-Based Access Control: Users can be given access to only the rows they need to do their jobs, limiting their scope. This allows users to focus on what they need to do and prevents accidental overexposure of data.

2. Column-Level Access Control: Specific columns can be shown or hidden based on a user’s permissions. This allows fine-grained access to data while respecting both privacy and security.

3. Dynamic Data Masking: When most people think of masking, they think of the kind of masking applied to credit card numbers, and that is possible with Unity Catalog views, but there is a whole universe of masking available to a developer in Databricks now. This dynamic masking can occur at a row or column level and can be used to restrict access to specific restricted information while still granting it to those with need to know. In this way, aggregations can be maintained without the need for complex rollup tables.

All these capabilities come together to provide a new, data governance integrated experience that allows for duplicate effort and data to be minimized across a data estate. With this end-users and citizen data scientists can be safely given access to data to enable them to use their domain-level expertise to find more possibilities. That brings us to an evolving paradigm in the data warehousing world.

A visual depiction of the lakehouse architecture, as modified with Unity Catalog views. In this case, the layer of physical transformations from silver to gold is removed, with the gold tables being represented entirely by views.

How Views Change the Lakehouse

In a traditional lakehouse, data is copied into each layer as it is transformed. This represents an opportunity for inconsistencies and stale — or even bad — data to become a part of our final analysis set. This copying also means that the entire data warehouse must be reloaded from scratch when changes are made, reducing the nimbleness of the data warehouse, and making it impossible to represent as highly consistent and repeatable code, rather than changeable and expensive data. The view changes this by making it possible to reduce the number of copies of the data from three (or more) to just two — the data in the bronze layer that has not yet been transformed, and the clean enriched data in the silver layer. Gold simply becomes a view (or many views) on top of the silver layer.

This change drastically reduces the amount of work needed to create a secure lakehouse. Under previous models, it would often be necessary to create entirely different tables for each persona, and move data into them, creating significant delays and data duplication. This also introduced an opportunity for data to become out of sync, or for the wrong data to be copied into the wrong place. Increasing the risk of a lakehouse compared to a traditional data warehouse.

Enabling Data Products and Data Mesh

This makes Unity Catalog views an ideal way to represent data products and to build into a data mesh architecture. By creating a pattern that can be used to represent both complex, IT-driven data products as well as local, departmentally produced data products, the paradigm builds the foundation for a true data mesh architecture.

A depiction of a lakehouse as a data product. The previous models are now wrapped as a data product and are presented to data consumers.

Non-IT organizations can build out and provide their own data products based on data that they have that is integral to a specific use case, and customize it to many different personas — and these personas do not have to be the same personas used by other data products — each data product can be totally a totally independent member of the organizations data estate.

Following on to that, though, is the ability for views (and these view-based data products) to build upon other views. This truly exemplifies the data mesh architecture, and the dynamic way all these data models interact can give a persona truly bespoke data model access. Not simply to the specific rows and records that they need, but customizing data display within those records, ensuring that data availability is maximized while data privacy and data security are not compromised.

A visual combining the various data products into a data mesh, with multiple data products being combined together and presented to data consumers.

Unity Catalog represents a significant shift in the way we think about data management and the medallion architecture. By reducing the number of copies of data and enabling the use of views, the Unity Catalog improves data governance and reduces the risk of inconsistencies and stale data. Furthermore, the Unity Catalog enables the creation of data products and the implementation of a data mesh architecture, allowing for more dynamic and flexible data management. Overall, the Unity Catalog has the potential to revolutionize the way organizations manage their data, improving efficiency, security, and flexibility.

Have questions? Tag me in the comments or reach out to me and my team at Neudesic.

--

--