Databricks System Table Deep Dive — Data Tagging Design Patterns

Databricks SQL SME

Published in

DBSQL SME Engineering

13 min readFeb 14, 2024

Author: Peter Davis, Senior Solutions Architect at Databricks

Introduction:

A cornerstone of building a high performance data and AI organization is the need for high quality, accurate data that can be democratized. Data can be made discoverable through generative AI algorithms but data consumers need context for the data to use it efficiently.

Data producers (engineers, analysts and data scientists) add value to our data with curated comments and tags, whether the information originates with humans or AI. This curated information in turn benefits users for discoverability through improved:

Natural language search results
Context for data consumers — the meaning of the data lives with the data eliminating the need to chase down data producers or engage in repeated debate
Duplicate avoidance for new data products — we can discover existing similar data products much more easily than in the past.

In this article we will tackle the problem of ensuring high quality metadata for your data intelligence platform by showing how to enter, analyze and audit comments and tags. In the platform, there are tags related to data assets, and tags related to other objects such as compute/jobs/etc. We will focus on Data asset tags in this article.

Enabling data producers to easily tag and comment is absolutely essential for scaling democratization efforts in the enterprise. Databricks makes scaling context input trivial by securely suggesting AI comments for human approval as well as a simple UI interface for adding tags.

As we enable more producers to document our data, we also increase the need for monitoring and data management in order to mitigate:

Close duplicates and misspelled entries diluting data discoverability and understanding.
Inappropriate tags and comments entries put the business at risk.

In the remainder of this article we will go through the lifecycle of a tag and comment, from original creation through modifications and who was involved. We will start with an overview of metadata through information_schema tables.

Information Schema — metadata overview

Within your Unity Catalog environment there is a “system” catalog. The system catalog contains a schema (aka database) called “information_schema.”

Databricks’ information_schema contains a series of tables that describe objects in the catalog that you have permission to see, including Tags and Comments.

Navigating information_schema — system catalog vs non-system catalog:

Within every non-system catalog there is an information_schema database that is a subset of system.information_schema. This is simply a set of views derived from system.information_schema tables that are filtered for the catalog.

Below are two SHOW CREATE TABLE outputs, one for system and one for our demo catalog:

In the system wide catalog for the table column_tags:

CREATE TABLE system.information_schema.column_tags (
 catalog_name STRING NOT NULL,
 schema_name STRING NOT NULL,
 table_name STRING NOT NULL,
 column_name STRING NOT NULL,
 tag_name STRING NOT NULL,
 tag_value STRING NOT NULL)

In the catalog dbsql_blog_tags_comments:

CREATE VIEW information_schema.column_tags (
 catalog_name,
 schema_name,
 table_name,
 column_name,
 tag_name,
 tag_value)
AS SELECT * FROM system.information_schema.column_tags WHERE catalog_name = 'dbsql_blog_tags_comments'

Information_schema has over 40 tables, their relationships are outlined in this entity relationship diagram. To simplify the problem space I grouped the tables into 7 subjective categories, as follows:

In this article we will focus on two categories, Tags and UC Objects (for the purposes of this article we will focus on Objects with Comments).

Tags are associated with objects and are tracked in their own tables, one for each of the objects.

Tags and Comments

We will briefly describe Databricks UI and SQL code entry of tags and comments.

Databricks UI

Users who have the APPLY TAG permission can directly add and edit tags in the Catalog Explorer show below:

Similarly, users can individually add and edit comments in the Catalog Explorer, either manually or by editing and accepting AI generated comments.

Databricks SQL code

Comments and tags can also be added or updated via code, e.g.

the ALTER TABLE…ALTER COLUMN command to add comments to columns
COMMENT ON command to add tables to objects like tables, catalogs, etc.
Tags may also be SET and UNSET via SQL commands.

Setting tags and comments is easy, but the real power is in being able to use tagging for analysis.

Data Analysis

Tag Analysis:

Tags are key value pairs that provide context for the objects they are attached to. In Databricks, there are two types of tags, operational monitoring tags for billing and tags on securable objects in Unity Catalog.

This article will focus on tags on objects, specifically tags on catalogs, databases (schemas), tables and columns collected in the following tables in system.information_schema:

These tables track tags that have been associated with each of the corresponding object types.

The tables have an additive column structure, which makes it easy to build a comprehensive view that shows tags across across the entire metastore by unioning the corresponding tables and adding a bit of contextual information for object type:

CREATE OR REPLACE VIEW dbsql_blog_tags_comments.system_analysis.uc_tags AS
SELECT
   catalog_name,
   null as schema_name,
   null as table_name,
   null as column_name,
   tag_name,
   tag_value,
   'catalog' as object_type
FROM
   system.information_schema.catalog_tags
UNION ALL
SELECT
   catalog_name,
   schema_name,
   null as table_name,
   null as column_name,
   tag_name,
   tag_value,
   'schema' as object_type
FROM
   system.information_schema.schema_tags
UNION ALL
SELECT
   catalog_name,
   schema_name,
   table_name,
   null as column_name,
   tag_name,
   tag_value,
   'table' as object_type
FROM
   system.information_schema.table_tags
UNION ALL   
SELECT
   catalog_name,
   schema_name,
   table_name,
   column_name,
   tag_name,
   tag_value,
   'column' as object_type
FROM
   system.information_schema.column_tags

We can then easily analyze the tags that users have associated with your data objects.

select tag_name, tag_value, count(*) as total_tags from dbsql_blog_tags_comments.system_analysis.uc_tags
group by tag_name, tag_value
order by total_tags desc
limit 10

We can isolate a set of catalogs (or other objects) by simply filtering this query. Since each catalog has a subset of information_schema as an alternative we could run similar queries directly on the catalog.

select tag_name, tag_value, count(*) as total_tags from dbsql_blog_tags_comments.system_analysis.uc_tags
where catalog_name = 'dbsql_blog_tags_comments'
group by tag_name, tag_value
order by total_tags desc
limit 10

Because we added an object_type column, we can easily analyze tags by the object type.

 
select object_type, tag_name, count(*) as total_tags from dbsql_blog_tags_comments.system_analysis.uc_tags
group by object_type, tag_name
order by object_type, total_tags desc
limit 10

Comment Analysis:

Like Tags, comments provide incredibly important context about your data to your users and to the data intelligence platform as Unity Catalog search takes into account comments on tables and columns.

Users can add comments in two ways:

Catalog Explorer UI
Programatically: with COMMENT ON handles Securable Objects like catalog, schema, setting comments on table creation or by altering an existing table, or using ALTER TABLE ALTER COLUMN

Comments can be individually reviewed in the Catalog Explorer UI at the catalog, schema, table and column level.

Just like for Tags, to compare Comments across data objects we can leverage information_schema.

We start with a simple query on catalogs:

select catalog_name, comment from system.information_schema.catalogs;

We can also look at schema, tables and columns:

Here are schemas with internal databricks catalogs filtered out to reduce noise:

select catalog_name, schema_name, comment from system.information_schema.schemata where catalog_name <> '__databricks_internal' order by catalog_name, schema_name;

Tables:

select table_name, comment from system.information_schema.tables;

And columns, restricted to our test catalog, schema and table:

-- view column comments (and other values)
select
 table_name,
 column_name,
 comment
from
 dbsql_blog_tags_comments.information_schema.columns
where
 table_catalog = 'dbsql_blog_tags_comments'
 AND table_schema = 'schema_test'
 AND table_name = 'view_test'
order by
 table_name,
 column_name;

Like tags, the comments securable objects also have an additive column structure:

Putting them all together we can create a view like this:

CREATE OR REPLACE VIEW dbsql_blog_tags_comments.system_analysis.uc_comments AS
SELECT
   catalog_name,
   null as schema_name,
   null as table_name,
   null as column_name,
   comment,
   IF(LENGTH(comment) = 0 OR comment IS NULL, 0, 1) as has_comment,
   'catalog' as object_type
FROM
   system.information_schema.catalogs
UNION ALL
SELECT
   catalog_name,
   schema_name,
   null as table_name,
   null as column_name,
   comment,
   IF(LENGTH(comment) = 0, 0, 1) as has_comment,
   'schema' as object_type
FROM
   system.information_schema.schemata
   where catalog_name <> '__databricks_internal'
   AND schema_name <> 'information_schema'
UNION ALL
SELECT
   table_catalog as catalog_name,
   table_schema as schema_name,
   table_name,
   null as column_name,
   comment,
   IF(LENGTH(comment) = 0, 0, 1) as has_comment,
   'catalog' as object_type
FROM
   system.information_schema.tables
WHERE
   table_catalog <> '__databricks_internal'
   AND table_schema <> 'information_schema'
UNION ALL
SELECT
   table_catalog as catalog_name,
   table_schema as schema_name,
   table_name,
   column_name,
   comment,
   IF(LENGTH(comment) = 0, 0, 1) as has_comment,
   'catalog' as object_type
FROM
   system.information_schema.columns
WHERE
   table_catalog <> '__databricks_internal'
   AND table_schema <> 'information_schema'

We could do further analysis on this view but in large enterprise systems this will be a lot of data and running queries will add resources per run. With Serverless enabled we can shift to a Materialized View to make our analysis much more efficient at the cost of having to refresh the data. We can either refresh manually or set up a schedule.

DROP MATERIALIZED VIEW IF EXISTS dbsql_blog_tags_comments.system_analysis.mv_uc_comments;


CREATE MATERIALIZED VIEW dbsql_blog_tags_comments.system_analysis.mv_uc_comments AS
SELECT
   catalog_name,
   null as schema_name,
   null as table_name,
   null as column_name,
   comment,
   IF(LENGTH(comment) = 0 OR comment IS NULL, 0, 1) as has_comment,
   'catalog' as object_type
FROM
   system.information_schema.catalogs
UNION ALL
SELECT
   catalog_name,
   schema_name,
   null as table_name,
   null as column_name,
   comment,
   IF(LENGTH(comment) = 0 OR comment IS NULL, 0, 1) as has_comment,
   'schema' as object_type
FROM
   system.information_schema.schemata
   where catalog_name <> '__databricks_internal'
   AND schema_name <> 'information_schema'
UNION ALL
SELECT
   table_catalog as catalog_name,
   table_schema as schema_name,
   table_name,
   null as column_name,
   comment,
   IF(LENGTH(comment) = 0 OR comment IS NULL, 0, 1) as has_comment,
   'table' as object_type
FROM
   system.information_schema.tables
WHERE
   table_catalog <> '__databricks_internal'
   AND table_schema <> 'information_schema'
UNION ALL
SELECT
   table_catalog as catalog_name,
   table_schema as schema_name,
   table_name,
   column_name,
   comment,
   IF(LENGTH(comment) = 0 OR comment IS NULL, 0, 1) as has_comment,
   'column' as object_type
FROM
   system.information_schema.columns
WHERE
   table_catalog <> '__databricks_internal'
   AND table_schema <> 'information_schema';

Now we can easily analyze our comments. For the following queries, if you are not using a materialized view simply modify the query to remove the “mv_” prefix but the queries will take additional time and resources in large, active environments.

How many non null comments exist that have length > 0:

select has_comment, count(*) from dbsql_blog_tags_comments.system_analysis.mv_uc_comments group by has_comment;

What is our comment coverage by object type?

select object_type, count(*) as total from dbsql_blog_tags_comments.system_analysis.mv_uc_comments
where has_comment = 1
group by object_type
order by total desc;

Let’s look at any comment with more than 10 entries across objects, ordered by total number of comments (useful for looking at auto generated comments, scanning top comments):

select comment, count(*) as total from dbsql_blog_tags_comments.system_analysis.mv_uc_comments
where has_comment = 1
group by comment
having total > 10
order by total desc

In-depth analysis of comments is beyond the scope of this article but we can take this much further with Natural Language Processing. As a simple example we can scan comments for vulgarity with the open source profanity_check library.

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from profanity_check import predict_prob

## Wrap prediction function from library in UDF
def predict_prob_udf(str):
   return float(predict_prob([str]))

## Register UDF
predict_prob_spark_udf = udf(predict_prob_udf, DoubleType())

## Utilize in spark
df = spark.sql("select * from dbsql_blog_tags_comments.system_analysis.uc_comment_non_blank")


df_comment_sentiment = df.withColumn("comment_pred", predict_prob_spark_udf("comment"))


display(df_comment_sentiment.filter(df_comment_sentiment.comment_pred > .5))

There are endless possibilities, but this is an example that can be applied and even combined with Databricks’ new ai_query function to build a truly robust data tagging and commenting system for enterprise-scale data intelligence.

Data Audit/Forensics

When we need to understand the lifecycle of a comment or Tag and which person or service may have been part of the lifecycle, we can leverage the system.access.audit table to know who was involved.

First it’s important to understand the difference between securable objects and sub-entities or sub-securables in Unity Catalog.

Intro to Unity Catalog privileges:

Unity Catalog’s privilege types govern two levels of objects, securable and sub-entities.

Securable objects in Unity Catalog are objects in the metastore that may have privileges granted to a principal.
Sub-entities are objects that exist within a securable object

Catalog, database, and table are hierarchical objects in the metastore that are securable, columns, tags and comments exist as sub-entities — these are highlighted in the screenshot of the table dbsql_blog_tags_comments.tpch.customer_orders from the Catalog, below. Each type requires slightly different analysis in systems tables.

author screenshot: data asset tags breakdown

Tag Audits

Tags are associated with objects on creation in the UI or code. If they don’t exist, they are also created, then associated with the object.

To understand which users have been associating tags with objects we will look at the tag update action in the system.access.audit table. The audit tables will be quite large so it is always advisable to restrict the data you are querying, in this case we filter by service name and date.

select * from system.access.audit where event_date 
BETWEEN '2023-12-01' AND '2023-12-08'
AND service_name = 'unityCatalog'
and (action_name = 'UpdateTagSecurableAssignments'
OR action_name = 'UpdateTagSubentityAssignments')

In this query, service_name relates to which service initiated the request and action_name is the category of the event that was captured. In this case we are selecting tags on securable objects like catalog, schema, table as well as subentities like columns.

This next query gives us all updates to securable objects in UC for the date range. We can restrict the results to a specific object by adding a filter in the request_params column on the key “securable_full_name”.

select * from system.access.audit
where event_date BETWEEN '2023-12-01' AND '2023-12-08'
AND service_name = 'unityCatalog'
and (action_name = 'UpdateTagSecurableAssignments'
OR action_name = 'UpdateTagSubentityAssignments')
AND request_params.securable_full_name = 'dbsql_blog_tags_comments.schema_test.table_test'

In the results we will see everyone who updated tags on the table dbsql_blog_tags_comments.schema_test.table_test.

Comment Audits

Tracking comment activity:

As we did with Tags, we can trace user activity on comments with objects by leveraging system tables. Comments are saved as part of an object’s metadata in two ways, depending on scope:

as part of a securable object’s metadata (catalog, schema, table, volume, model)
as part of column metadata in a table (aka subsecurable)

Auditing comment activity:

Comments are changed through code or the Databricks UI. These are run by separate services and tracked as separate actions in system tables.

Auditing Databricks UI comment activity — securable objects

The following query looks at Unity Catalog events where the action is to update or create a table from the UI.

The result set will show us all audit data for comments within the specified time frame that were created by the user.

The IS NOT NULL filter ensures we see comment commits whether they are updating or blanking out the text. Date and identity filters keep the query performant and we can order by date and time to surface the latest commits of comments first.
In this query, service_name relates to which service initiated the request and action_name is the category of the event that was captured.
If we find there is noise due to automated comments (e.g. file uploads) we can always exclude them as part of the filters.

select user_identity, service_name, action_name, request_params.comment, event_date, event_time request_params, audit_level from system.access.audit
where 1=1
AND event_date BETWEEN '2024-01-08' AND '2024-01-16'
AND service_name = 'unityCatalog'
and (action_name = 'updateTables'
OR action_name = 'createTable')
AND request_params.comment IS NOT NULL
and user_identity.email = 'peter.davis@databricks.com'
order by user_identity, comment, event_date desc, event_time desc

We could easily modify this example to look at other objects like volume, model, schema or catalog.

Auditing UI based comment activity — subsecurable:

Column comment commits are done as part of the lineageTracking service during the ‘generateTableDoc’ action.

The following query will pull all UI based commits of the embedded comment in the filter. This is an expensive query so it is best to ensure that our filters are as granular as possible.

select user_identity, service_name, action_name, request_params.table_schema
from system.access.audit
where 1=1
AND event_date BETWEEN '2024-01-22' AND '2024-01-23'
and (action_name = 'generateTableDoc')
AND request_params.table_schema:columns LIKE '%This is the calculated distance for the trip, in miles%'
and user_identity.email = 'peter.davis@databricks.com'
order by user_identity, action_name, event_date desc, event_time desc

Auditing code based comment activity

Comments committed in code can be looked up through submitted commands. Here we winnow by date, service name and action name. With this I can review any comment activity for the date range, whether it is on an object or column.

select request_params.commandText, *
from system.access.audit
where 1=1
AND event_date BETWEEN '2024-01-20' AND '2024-01-23'
AND service_name = 'databrickssql'
and (action_name = 'commandSubmit')
AND request_params.commandText like '%COMMENT%'
order by request_params.commandText

Current Limitations

Comments

Comments are available on catalogs, schemas, tables, shares, recipients, providers, volumes and columns.

Catalogs, shares, recipients, and providers comments are supported in Unity Catalog only.

Please see the Comment On doc page for the current list of limits and constraints.

Wrapup

In this article we examined the lifecycle of tags and comments, beginning with how they are entered. We looked at information_schema to get a sense of how they are recorded in Databricks. We built views to make it easier to analyze comments and tags that have been entered on the platform. Finally, we looked at how to audit activity on tags and comments by analyzing Databricks System Tables Audit features.

Any of this code can be experimented with and modified for your needs. Please let us know whether this was useful to you and what other DBSQL and Unity Catalog and system table topics you would like to see covered. System table metadata is large and growing, so if you have a use case you want to see in a tutorial, message us or comment!