Understanding Data Access Patterns with Unity Catalog Lineage

Databricks SQL SME

Published in

DBSQL SME Engineering

9 min readJul 11, 2024

Author: Peter Davis, Sr. Solutions Architect @ Databricks

Introduction

Unity Catalog’s data lineage automatically accelerates your team’s understanding of your data. Data dependencies and consumption patterns are generated from cluster consumption logs and reflect your customers’ usage of the Data Intelligence Platform with excellent discoverability, context, and optimizations under the hood.

This article shows how to leverage lineage data to audit data usage with Unity Catalog’s System Tables. We will begin with an overview of lineage tables and then show how to analyze data usage to understand who is viewing what data, where they are viewing it from, and when they are viewing it.

Overview

With Unity Catalog enabled, we get lineage as our users work. Upstream and downstream relationships are tracked as users prepare and curate data products. Metadata is automatically generated through lineage and audit trails, giving immediate context and meaning to your data.

This context accelerates the work for key personas:

Data analysts and data scientists understand data provenance where they need it most, right on the data. This context reduces confusion about which table or metric is the most appropriate for a model or report, accelerating delivery time and increasing data confidence.
Data governance roles (owners, stewards, security, privacy) can leverage table access and usage information. This access information provides valuable context to understand the relative usage frequency of data and allows us to trace and audit data access patterns.

Lineage is available in the Catalog UI, where the interface expresses upstream and downstream relationships as visual graphs. The image below shows the lineage for three views generated as downstream data products from system tables. Both table and column lineage are available:

Unity Catalog Lineage for tables and columns

We can also directly query lineage in system tables via Notebooks or the SQL Editor. We can query table lineage to see upstream/downstream tables and analyze data usage, such as the who, when, and how users access data. The example below will show all accesses to the billing usage table over the last seven days:

SELECT
 *
FROM
 system.access.table_lineage
WHERE
 source_table_full_name = 'system.billing.usage'
AND datediff(now(), event_date) < 7

If you want a quick view of a table’s activity over the last 30 days, you can access table insights directly in the UI. The UC Insights tab provides information on recent activity, users, queries, and more.

Code and Dashboards

In the following blog, you can import and use the dashboards as well as the tutorial notebook from the DBSQL SME Git Repo here:

Lineage tables overview

Lineage system tables in Databricks allow you to query lineage data at the table and column level on the tables system.lineage.table_lineage and system.lineage.column_lineage.

Both lineage system tables follow the same schema, with two additions for column-level lineage, shown below:

The documentation for lineage system tables includes detailed descriptions of each field and provides several examples of querying the data.

The source/target fields are essential when reading lineage tables (for more details, including limitations on entity_type, please refer to the documentation). In summary, source and target type fields indicate usage through data reads, writes, or both.

To determine whether the event was a read or a write, you can view the source_type and the target_type fields.

Read-only: The source type is not null, but the target type is null.

Write-only: The target type is not null, but the source type is null.

Read and write: The source and target types are not null.

For the rest of this article, we will explore several use cases highlighting data access auditing and data object popularity, demonstrate techniques for analyzing your lineage data, and generate a dashboard you can customize for your insights.

Data Analyses

We have four sets of code linked to this article: a notebook with the following lineage table data exploration and three dashboards that showcase some of the notebook’s outputs. Both are immediately usable in any workspace where Unity Catalog and system tables are enabled.

Data exploration — Analysis setup

At the start of the notebook, we set up variables for execution for subsequent queries. Please change the values to be appropriate for your environment.

DECLARE OR REPLACE VARIABLE catalog_val STRING;
DECLARE OR REPLACE VARIABLE target_catalog_val STRING;
DECLARE OR REPLACE VARIABLE schema_val STRING;
DECLARE OR REPLACE VARIABLE table_val STRING;
DECLARE OR REPLACE VARIABLE column_val STRING;
DECLARE OR REPLACE VARIABLE email_val STRING;
DECLARE OR REPLACE VARIABLE table_full_name_val STRING;


SET VARIABLE catalog_val = 'system';
SET VARIABLE target_catalog_val = 'pdavis';
SET VARIABLE schema_val = 'billing';
SET VARIABLE table_val = 'usage';
SET VARIABLE column_val = 'usage_quantity';
SET VARIABLE table_full_name_val = concat(catalog_val, '.', schema_val, '.', table_val)
SET VARIABLE email_val = 'demo@databricks.com';

Table accesses

Table accesses across users:

Which users accessed a table, and what Databricks interface did they use to access the table over the last 7 days:

SELECT
 mask(created_by) as created_by, -- Masking function - nice feature
 entity_type,
 source_type,
 COUNT(distinct event_time) as access_count,
 MIN(event_date) as first_access_date,
 MAX(event_date) as last_access_date
FROM
 system.access.table_lineage
WHERE
 source_table_catalog = catalog_val
 AND source_table_schema = schema_val
 AND source_table_name = table_val
 AND datediff(now(), event_date) < 7
 AND entity_type IS NOT NULL
 AND source_type IS NOT NULL
GROUP BY
 ALL -- Nice feature
ORDER BY
 ALL -- Nice feature

Here, we get a result of email addresses (created_by), interfaces (entity_type), first and last access dates, and a count of table accesses.

I limit the overall time span to a week to minimize resource usage during exploration, and I use a masking function to hide the email addresses in the field created_by for this article.

Accesses of tables in a specific catalog and schema:

SELECT
 source_table_name,
 entity_type,
 created_by,
 source_type,
 COUNT(distinct event_time) as access_count,
 MIN(event_date) as first_access_date,
 MAX(event_date) as last_access_date
FROM
 system.access.table_lineage
WHERE
 source_table_catalog = catalog_val
 AND source_table_schema = schema_val
 AND datediff(now(), event_date) < 30
GROUP BY
 ALL
ORDER BY
 ALL

Here, we see all tables accessed within a particular catalog and schema over the last 30 days, who accessed them, and the first and last access dates.

Table accesses by specific users

Which tables did a specific user access in the system catalog over the last 90 days:

SELECT
 source_table_catalog,
 source_table_schema,
 source_table_name,
 entity_type,
 source_type,
 mask(created_by) as created_by,
 COUNT(distinct event_time) as access_count,
 MIN(event_date) as first_access_date,
 MAX(event_date) as last_access_date
FROM
 system.access.table_lineage
WHERE
 created_by = email_val
 AND datediff(now(), event_date) < 90
 and entity_type is not NULL
 and source_table_catalog = 'system'
GROUP BY
 ALL
ORDER BY
 ALL

Object Popularity

The most popular system table objects

As measured by a simple count of accesses over the last 7 days.

SELECT
 source_table_full_name,
 count(*) as lineage_total
FROM
 system.access.table_lineage
WHERE
 datediff(now(), event_date) < 7
 AND source_table_catalog = 'system'
GROUP BY
 ALL
ORDER by
 lineage_total DESC
LIMIT
 5

Object Lineage

For a single object, what are the immediate upstream and downstream objects?

with downstream AS (
select
  distinct target_table_catalog as table_catalog,
  target_table_schema as table_schema,
  target_table_name as table_name,
  'downstream' as direction,
  CASE WHEN tbl.table_catalog is null then 'no' else 'yes' end as current
from
  system.access.table_lineage tl
  left join system.information_schema.tables tbl
  on tl.target_table_full_name = concat(tbl.table_catalog, '.', tbl.table_schema, '.', tbl.table_name)
where
  source_table_full_name = table_full_name_val
  AND target_table_full_name is not null
 order by current desc, table_catalog, table_schema, table_name
)
,
upstream AS (
select
  distinct source_table_catalog as table_catalog,
  source_table_schema as table_schema,
  source_table_name as table_name,
  'upstream' as direction,
  CASE WHEN tbl.table_catalog is null then 'no' else 'yes' end as current
from
  system.access.table_lineage tl
  left join system.information_schema.tables tbl
  on tl.source_table_full_name = concat(tbl.table_catalog, '.', tbl.table_schema, '.', tbl.table_name)


where
  target_table_full_name = table_full_name_val
  AND source_table_full_name is not null
 order by current desc, table_catalog, table_schema, table_name
)
select * from upstream
UNION ALL
select * from downstream

Column Level Analyses

Column references:

The most referenced columns and their tables in a catalog over the last 90 days

SELECT
 source_column_name,
 source_table_full_name,
 COUNT(*) AS frequency
FROM
 system.access.column_lineage
WHERE
 1 = 1
 AND source_type <> 'PATH'
 AND datediff(now(), event_date) < 90
 AND source_table_catalog = catalog_val
GROUP BY
 source_column_name,
 source_table_full_name
ORDER BY
 frequency DESC
LIMIT
 10

What objects depend on a specific column?

The column reads and target downstream columns/tables from a single column over the last 90 days.

SELECT
 target_column_name,
 target_table_full_name,
 COUNT(*) AS frequency
FROM
 system.access.column_lineage
WHERE
 1 = 1
 AND source_column_name = column_val
 AND source_table_full_name = table_full_name_val
 AND datediff(now(), event_date) < 90
 AND target_table_full_name IS NOT NULL
 AND target_table_catalog = target_catalog_val
 GROUP BY
 target_column_name,
 target_table_full_name
ORDER BY
 frequency DESC
LIMIT
 10

Column access — frequency + users:

Accesses of a single column/table over the last 90 days — in this case, accesses of the usage column within the billing tables.


SELECT
 event_date,
 COUNT(*) AS frequency
FROM
 system.access.column_lineage
WHERE
 1 = 1
 AND source_column_name = column_val
 AND source_table_full_name = table_full_name_val
 AND datediff(now(), event_date) < 90
GROUP BY
 ALL

Which users accessed a column, and what Databricks interface did they use to access the table?

SELECT
 mask(created_by) as created_by,
 entity_type,
 source_type,
 COUNT(distinct event_time) as access_count,
 MIN(event_date) as first_access_date,
 MAX(event_date) as last_access_date
FROM
 system.access.column_lineage
WHERE
 source_table_catalog = catalog_val
 AND source_table_schema = schema_val
 AND source_table_name = table_val
 AND source_column_name = column_val
 AND datediff(now(), event_date) < 90
 AND entity_type is not NULL
 AND source_type IS NOT NULL
GROUP BY
 ALL
ORDER BY
 last_access_date desc
LIMIT 10

Putting it all together — Dashboards

The following dashboards will help you start analyzing Unity Catalog Lineage tables. Users are encouraged to try the dashboards and then experiment with what is most useful in their environment.

We have three dashboards featuring:

Table lineage
Column lineage
Tags and column lineage

Table access analysis

Putting it all together, I created a dashboard to show:

Table accesses by date
Columns in use on the table
Who accessed the table from what interfaces
Related upstream/downstream tables (including deprecated dependencies)

By importing the dashboard and entering any 3-level table name, you can review the same information on tables (provided you have permission to see the table metadata).

Column access analysis

I added a dashboard to review column lineage information. This dashboard shows:

A count of tables with the same column name
How many objects depend on this column
Total accesses of the column during the date range
A chart of accesses over time
A table showing the frequency of downstream usage of the column
A table showing who has accessed the column from what interface, how many accesses, and first and last access dates within the time frame
Tables that have the same column name and frequency counts of usage

Tag and lineage dashboard

Finally, I have a dashboard to review tags in the system leveraging some of the lessons learned in my article on Tag and Column access patterns:

Heatmap of tagging name and value — this helps to understand how people use tags and reduce the potential noise of tags in the enterprise.
A list of column names tagged with the name or value (masked in this example)
Total accesses of the tagged columns during the date range
Lists of Catalogs, Schemas, and Tables tagged (masked for the example)
A list of tagged columns accessed with who and how they were accessed — again, masked for the example.

Current Limitations

Please see the Data Lineage doc page for the complete list of limits and constraints.

Lineage System Tables are currently in Public Preview. As the tables move to GA, columns, schemas, and data retained may change.

Not all entity types are currently trackable; entity type will be NULL for untracked types.

Aggregate counts in these queries may rarely have slight overcounting due to duplicate records issued by multiple logging tools.

Wrap up

In this article, we explored using Databricks Lineage System Tables to understand who is accessing data objects, how popular data objects are, what objects user groups are accessing, and which tables are upstream and downstream.

This code is available as a notebook and the following three dashboards:

Please download and experiment with the code. Let us know whether this was useful and what other DBSQL and Unity Catalog topics you would like to see covered.