Data Governance on Databricks 101

Published in

D ONE

6 min readNov 26, 2023

As organisations strive to make better use of their data and utilise the latest technologies such as GenAI, the role that data quality plays in their reliability is ever-increasing. As such, prominent figures in the Machine Learning sphere are shifting their focus from the model to the data that is being used. For an example see a summary of Andrew Ng talk on “Data Centric AI”).

Data Governance is a framework developed to support data-driven decision making, by managing the availability, usability, integrity and security of the data across an organisation; in simple words — it makes sure that the right data with the right level of quality are used as the basis for organisational operations. For a more detailed primer on the topic visit this CIO article.

While the concept is platform agnostic, dealing with the organisational aspect of working with data, modern data platforms support its set-up and implementation with various built-in features. This article explores how the Databricks platform enables this, diving into 3 Data Governance aspects: data discovery and cataloguing, data quality, and data access management.

tl;dr There is always room for improvement, but the Databricks platform does a very good job of supporting Data Governance aspects in our opinion.

Product Profitability Analysis: running example

To showcase a possible implementation of data governance aspects, we built a working example on the platform. You can follow a link at the end of the article to partially reproduce it.

In the example we use simulated data for a basic profitability analysis performed on store transaction data. The flow components are depicted in image 1.
All the datasets were registered on Unity Catalog, and the whole process (processing pipeline and dashboard update) are orchestrated using a single job (visit this link to get started with Databricks jobs).

Data Governance on Databricks

This section is split into 3 parts: data discovery and cataloguing (1), data quality (2), and data access management (3). For each part we briefly discuss the available features, and demonstrate them using the running example we’ve built.

1. Data discovery and cataloguing

To facilitate data democratisation it is important to support data discovery with appropriate metadata.
For this, Databricks rolled out Unity Catalog (see this link to get started). It supports the following features for each registered catalog/database:

Attribute level comments.
These can be set both in the code using a custom syntax (example below), and manually in the catalog interface (see image 2).

-- add a general comment for the table
COMMENT ON TABLE product_profit_analysis IS "Summary statistics for product profitability analysis based on all store transactions.";
-- comment on specific attributes
ALTER TABLE product_profit_analysis CHANGE ProductId ProductId STRING COMMENT "The product ID is not a guaranteed unique identifier.";
ALTER TABLE product_profit_analysis CHANGE TotalPurchases TotalPurchases BIGINT COMMENT "Total number of products sold.";
ALTER TABLE product_profit_analysis CHANGE SellPrice SellPrice DOUBLE COMMENT "Average sell price of the product [CHF].";

Image 2: Attribute documentation as it appears in the Unity Catalog. Clicking on the blue (+) bubbles allows editing.

2. Attribute level lineage for all datasets updated in the last 60 days (image 3).

Image 3: Attribute level lineage as part of the Unity Catalog. This specific example shows how the raw table attribute SellPrice (left) is propagating throughout the analysis tables (middle and right).

3. Native external-data management.
This new feature is supporting a single-source of truth for cataloguing.

4. Discovery of attributes via SQL queries run on Unity Catalog system tables on the platform.
It can also be performed from remote systems by using, e.g., Databricks Connect.

Drawbacks: data exploration is performed via SQL queries, which is cumbersome. The context-aware Large Language Model (LLM) on top of Unity Catalog should be able to support such tasks as well, but our current experimentation didn’t result in reliable query generation — manual correction is still necessary. That said, with awareness of the limitations, it still speeds up the previous exploration method. See an example in image 4.

Image 4: To simplify data discovery, we asked the assistant to provide a query corresponding with our needs, described in natural language. It first provided a SQL like query that doesn’t work on the platform (left). Clicking on “Diagnose error” generated another message to the assistant, which in response attempted to fix the syntax (not shown). This, unfortunately, was still not good enough, since the catalog needed to be manually switched or specified, and the specified schema was incorrect. However, after manual fixing we quickly got the expected result (right). This is expected to improve after the roll out of Lakehouse IQ in 2024.

2. Data quality

One of the most important aspects of the transformation to becoming data-driven, is making sure the right data are available to the users. Since data are typically non-static and develop over time, it is a continuous process, requiring constant monitoring and improvement.

The Databricks platform supports this two-fold:

Using Delta-Live Pipelines for data processing allows the definition of data quality rules, the failure of which can be defined to stop the pipeline from building, preventing the propagation of bad data. It is also possible to create a separate output, isolating the bad data and supporting troubleshooting.
The platform has native dashboard capabilities, which, combined with an alerting mechanism, support the construction of a monitoring dashboard (for the latter, see image 5). Moreover, Lakehouse Monitoring is currently in public preview (for details, see here), and expected to simplify the development effort. As of today, the data quality rules themselves need to be either manually defined, or created using supporting suits such as dbt, and stored in designated output tables to serve as basis for the monitoring dashboard.

Image 5: A basic data quality dashboard showing the latest quality status of the raw transactions dataset. The table at the top left gives general details regarding the checks run. The top right visualisation displays the aggregated pass vs fail portion of the various data quality criteria types, while the bottom provides details regarding all the executed checks .

Drawbacks: while the overall set-up is customisable and most of the manual work is necessary for the data quality rules’ definition, it could be useful to have native health checks for common operations (primary key, null portion, specific values, time since updated, etc.), such as in Palantir Foundry, and have easy access to the results thereof (which Foundry does not provide as of yet).

3. Data access management

Providing users with access to data is the goal of data democratisation. Nonetheless, making sure data confidentiality is not compromised is a top priority.

Databricks provides custom SQL statements to manage access rights for resources registered under Unity Catalog; an example is available in the code snippet below. This simplifies the task of managing fine-grained access, as it can all be done via a combination of (version controllable) lists of accesses, to be used in designated code, for example.

-- managing permissions via custom syntax
-- 1. display all users who can grant permissions for the table
SHOW GRANTS ON TABLE product_profit_analysis
-- 2. provide user1 with permissions to modify the table
GRANT MODIFY ON TABLE product_profit_analysis TO `user1@email.com`;
-- 3. revoke modification permissions from user2
REVOKE MODIFY ON TABLE product_profit_analysis FROM `user2@email.com`;

Monitoring accesses is also enabled on two levels:

Using the custom SQL, it is possible to extract all the different access rights defined on resources, and granted to specific users. See example in the code snippet below.
To make sure that resources are accessed in an expected manner, it is possible to monitor access logs (their availability needs to be configured).

-- extracting a list of all privileges under the catalog
SELECT *
FROM store_data_catalog.information_schema.table_privileges
WHERE 1=1
 - limit privileges to tables of interest
AND grantor <> 'System user'
AND table_schema <> 'information_schema'

Summary

In this article we discussed and demonstrated how the various aspects of Data Governance can be addressed in the Databricks platform.
In our opinion, it does a very good job in supporting the efforts, as:

Unity Catalog facilitates data discovery and documentation by supporting management of external data sources, offering various options for attribute documentation, and enabling discovery via custom SQL syntax.
Databricks enables the integration of data quality rules as part of the Delta-Live Pipelines, ensuring monitoring and maintenance of the quality of the data. These rules prevent the propagation of bad data, while enabling troubleshooting. Additionally, the platform’s native dashboarding and alerting capabilities support the construction of a monitoring tool for any custom rules defined on top of tables.
Access rights are easily managed by the platform via custom SQL, with the added option of configurable availability of access logs, thus enabling data security.

Have thoughts on Databricks capabilities or interested in hearing more about the platform?

Let us know!

Interested in reproducing the example in your Databricks account?

Check out the materials available here — see the README file for instructions.

Credits

This article was written with the support of Spyros Cavadias, a Databricks Champion.
A big thank you also goes to Robert Yousif and Joël Ineichen for their feedback.