Data governance in Databricks using unity catalog

Kaviprakash Selvaraj
5 min readMar 8, 2024

--

Organizations are migrating towards unity catalog from legacy hive metastore for better data governance in Databricks. As our organization also rolled out unity catalog as managed service this month, I have leveraged it and built a mini project.

Unity catalog is an unified solution in Databricks that will help to implement data governance in lake house.

Data governance is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal standards and policies that also control data usage. The key benefit of data governance is that it ensures that data is consistent and trustworthy and doesn’t get misused.

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

Reference Architecture (Metastore setup):
Detailed Architecture
  • Metastore: The top-level container for metadata. Each metastore exposes a three-level namespace (catalog.schema.table) that organizes your data.
  • Catalog: The first layer of the object hierarchy, used to organize your data assets.
  • Schema: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.
  • Table: The lowest level in the object hierarchy, tables can be external or managed.
  • To access a table,
SELECT * FROM catalog.schema.name

SELECT * FROM dev_catalog.silver_schema.managedtable_1
Reference Architecture (External Location setup):

Storage credentials is a long-term cloud credential that provides access to cloud storage.

External locations contain a reference to a storage credential and a cloud storage path.

The Access Connector for Azure Databricks is a first-party Azure resource that lets you connect managed identities to an Azure Databricks account.

Pre-requisites:

  • Databricks workspace in premier tier
  • Azure Data Lake Gen2 container
  • Unity catalog meta store at organization level

Implementation:

Creating the unity catalog metastore and assigning the databricks workspace are not covered here, as these are admin level activities at the organization level. So the account admin will set this up and add the meta store admin to leverage it for the respective projects.

  1. Create access connector for Databricks workspace as per the Microsoft documentation guidelines. Access connectors allow to connect managed identities to an Azure Databricks account.

2. Create Storage Credentials — A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant.

3. Create an azure data lake gen 2 storage account to create managed and external containers which will store the data.

dev-catalog — Managed location for dev catalog, however this will be overridden by the managed locations at schema level. This is MANDATORY as part of our org policy.

bronze — Managed location for bronze external tables.

silver, gold — Managed location for silver and gold managed tables.

4. Grant access connector with storage blob contributor to the above storage account.

5. Create an external location as below in order to use this as a managed location for dev catalog. This is similar to mounting the blob containers using the legacy approach.

6. Create a new catalog — The first layer of the object hierarchy, used to organize your data assets.

7. Create the other required external locations using SQL as below.

Create external locations
1.bronze
2.silver
3.gold

CREATE EXTERNAL LOCATION IF NOT EXISTS `external-location-bronze`
URL 'abfss://bronze@ucdatabricksexternaldl.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL `uc-databricks-storagecrdentials`);

CREATE EXTERNAL LOCATION IF NOT EXISTS `external-location-silver`
URL 'abfss://silver@ucdatabricksexternaldl.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL `uc-databricks-storagecrdentials`);

CREATE EXTERNAL LOCATION IF NOT EXISTS `external-location-gold`
URL 'abfss://gold@ucdatabricksexternaldl.dfs.core.windows.net/'
WITH (STORAGE CREDENTIAL `uc-databricks-storagecrdentials`);

8. Now we need to create the required schemas and tables.

USE CATALOG `dev-sales-catalog`;

CREATE SCHEMA IF NOT EXISTS bronze
MANAGED LOCATION 'abfss://bronze@ucdatabricksexternaldl.dfs.core.windows.net/'

CREATE SCHEMA IF NOT EXISTS silver
MANAGED LOCATION 'abfss://silver@ucdatabricksexternaldl.dfs.core.windows.net/'

CREATE SCHEMA IF NOT EXISTS gold
MANAGED LOCATION 'abfss://gold@ucdatabricksexternaldl.dfs.core.windows.net/'

9. In this article, we have created an unity catalog for DEV, a storage credential to store the access connector details, external locations and schemas for bronze, silver and gold layers respectively..

10. I have created the external and managed tables based on the medallion architecture as below and steps for which are not covered here.

In this next section, we will see some of the inbuilt unity catalog features available.

Data Lineage: Details about upstream and downstream data sources in order to backtrack the data quality issues.

Access control: Grant access to only the required users and groups for files in data lake, Delta tables, Notebooks and workspaces in Databricks.

a. The following settings provide a user, group or service principal with read privileges to the Delta tables, views and functions across the catalog.

b. The following settings provide a user, group or service principal with read / write privileges to the Delta tables, views and functions across the catalog.

Auditing: Audit information of when and how often the data is being accessed by the users. Databricks currently allows sending the audit logs to the following destinations. We will see more detailed information in another article.

Data Discoverability: Searchable data catalog to identify correct dataset for the requirement among several thousands.

Conclusion: In this article, we have seen how to setup the unity catalog in databricks as a data governance solution and the benefits of it.

If you like this article, please follow me on linkedin at https://www.linkedin.com/in/kaviprakash-selvaraj/

--

--