How should you structure your Databricks Unity Catalog?

Leigh Robertson
3 min readAug 8, 2024

--

Photo by Viktor Talashuk on Unsplash

What is Unity Catalog?

Unity Catalog is a unified governance solution for data and AI assets in the Databricks Lakehouse Platform. It provides centralized data discovery, access control, and auditing capabilities across multiple workspaces and cloud environments, enabling organizations to simplify data management, enhance security, and ensure compliance with data regulations.

In other words, it makes your life a whole lot easier and make your day to day so much better when trying to implement secure governance controls across all assets in Databricks and outside.

Catalog Structure

For current users or those using Hive Metastore, the following article shows you why you should switch and what problems Unity Catalog solves. Older accounts and users of Databricks for a few years will understand the legacy pattern of Hive Metastore. If you are just starting your Databricks journey now, you won’t have a choice as it’s now the default pattern. This guide will help you make a decision on how to structure it and assist those migrating to Unity Catalog in designing the new structure.

General Guidelines

With Unity Catalog, you now have what’s called the triple-level namespace: catalog.schema.table_name. If you come from a database background like myself, it will look familiar, akin to older SQL databases with database.schema.table_name. The question now becomes: how should we structure the triple-layer namespace? The answer to this question is crucial as it will make governance simpler or more difficult depending on the choice made.

The two most common approaches I have seen are catalog = environment or catalog = business_line/product. For this exercise, I will introduce a fictional company I have used in past articles called wesellstuffonline.com, an online retailer selling whatever makes my analogies easiest to understand. For this example, let’s say wesellstuffonline has three product lines: auto, clothes, and electronics. Let’s mock up how this would look with the different approaches.

TLDR: Pick a catalog structure that will make governance in your company easiest.

Catalog as Environment

Let’s assume that the company has three environments: dev, stage, and prod, and has created three schemas for each product line. The structure would then look like the following. In Unity Catalog, you can control access to the catalog and the schema. This design pattern is usually the most common as it provides a clear logical distinction of where each environment is.

Catalog as Business Line/Product

This is a less common approach but one that I have seen customers use. If you have extremely high isolation requirements among business lines, it might make sense to use this structure as you could isolate access at the catalog level.

Conclusion:

There are probably others that make sense but these two are the most common! At the end of the day, pick what makes your life simpler and things easier to manage.

--

--