Photo by Brett Jordan on Unsplash

Databricks System Tables Overview — It’s About Answering Questions

Rob Wesley

--

Part 1

Note: The views expressed in this blog are my own and do not represent the official stance of Databricks.

What are Databricks System Tables, and how do you use them? I want to provide some answers to these questions in a series of posts. My intention is to cover them in depth, so I’ll be covering one to two tables in each post, providing details on the tables and example queries for these tables.

These are intended to be introductory-level posts, and I will provide SQL queries for these tables. However, these queries are written to be understandable rather than focused on performance. If you have a lot of experience with SQL, you’ll have different ways of writing the queries and the ability to write much more complex queries against these tables.

Introduction

The primary purpose of data is to answer questions. What are this quarter’s financial results? What are customers buying? Is this drug effective? It can also be a way to ask better questions (Sometimes Data Science Is Not About Giving Answers It’s About Asking Better Questions)

The purpose of Databricks System Tables is to provide the data to answer questions about Databricks, specifically questions about the operation of your Databricks Platform. How many DBUs are we consuming per month? Who is using the platform? How much does a job cost to run, and how long does it take?

A Dimensional Model

System Tables are organized in a dimension data model (If you’re new to dimensional modeling, please see this post for an introduction to the concepts). Some System Tables are fact tables (e.g., usage), and some provide dimensions for those facts (e.g., list_prices, clusters). Some are raw data tables (e.g., audit). Note that you’ll sometimes find fact data in dimensions and de-normalized dimension data in facts, so use this only as a rule of thumb. Knowing a table’s general type will help you determine what other tables you may need to join. “If I need information about how many DBUs a cluster used by cluster name, I’m going to need both the current row from clusters and join that to usage.” vs. “If I need information about the changes made to a cluster, I can get that from just clusters.” These questions raise an important point, which I repeat below. Some dimension tables are Type 2 slow-changing dimensions (i.e., they have a history of changes). You’ll need to know this if you need only the current record to avoid accidentally generating multiple rows in your result.

The Tables

Currently, System Tables are in Public Preview, with additional tables in Private Preview. I’ll only be discussing Public Preview tables. System Tables are available in all three clouds (Azure, AWS, GCP). For documentation links, I’ll provide the AWS link; for the other clouds, click the button on the top right of the page to switch clouds. System Tables are all contained in their own Unity Catalog catalog named system, with separate schemas for different categories of tables (e.g., access, billing, compute). Because they are contained in a UC catalog, a Unity Catalog-enabled workspace is required. Below is a list of System Tables that I intend to cover in this series of posts and the kinds of questions they can answer. If I have time, I will cover additional System Tables beyond these. The concept of Regional vs Global data is covered in the Be Aware section below.

system.access.audit — This is a regional table that is a table representation of the audit log. It’s a catch-all table decoded by service_name and action_name. It will provide a wealth of detailed answers about events occurring in the platform (users logging in, users accessing data, warehouses being started, stopped, etc.). There’s an awesome blog providing a wealth of queries using audit for security monitoring and alerting.

system.billing.list_prices — A Global Type 2 slow-changing dimension providing the list price for all SKUs for your cloud. If you need to know current or historical pricing for a particular SKU (e.g., what’s the price difference with and without Photon?). As noted above, this typically gets joined with usage to calculate the billable dollars spent. Note that table has list prices. If your company has any discounts they will not be reflected in this table.

system.billing.usage — This provides global data of all billable usage in your cloud account. It is aggregated by the hour. If you want to know how many DBUs are being spent by day/month/cluster/warehouse/SKU this is the table to answer those questions. Join with the list_prices table to calculate Dollars from DBUs. This is currently the primary fact table in System Tables.

system.compute.clusters — A regional Type 2 slow-changing dimension providing a history of changes made in all clusters (all-purpose and jobs) in the region. This does not have information on DBSQL warehouses (Fear not, there are other tables that will provide information on your warehouses). Any questions about when/how a cluster was changed or the current setup of a cluster will use this table to get the answer. For instance, if you want to know how many clusters use a particular init script, or a particular instance pool, this is the table to provide those answers.

system.compute.node_types — A regional dimension table providing details on available node types and hardware information such as the number of cores and the amount of memory. Join this to the clusters table to flesh out details on your cluster hardware.

system.compute.warehouse_events — A regional table providing the history of the events (starting, stopping, scale up, etc.) for DBSQL warehouses. This table will answer questions about the operation of your warehouses and can be joined to the usage table. Note that this table has warehouse IDs but not warehouse names. Currently, to find warehouse names, you can query the audit table. Here is an example query to get warehouse names:

with data as ( -- get all of the successful creates and edits of warehouses and endpoints
select event_time, request_params.name as warehouse_name, from_json(response ['result'], 'Map<STRING, STRING>') ["id"] as warehouse_id
from system.access.audit
where service_name = 'databrickssql'
and action_name in ('createWarehouse', 'createEndpoint')
and response.status_code = '200'
union
select event_time, request_params.name as warehouse_name, request_params.id as warehouse_id
from system.access.audit
where service_name = 'databrickssql'
and action_name in ('editWarehouse', 'editEndpoint')
and response.status_code = '200'
),
current_data as ( -- get the most recent create or edit of each warehouse or endpoint
select *,
ROW_NUMBER() OVER (
PARTITION BY warehouse_id
ORDER BY
event_time DESC
) AS row_num
from data
qualify row_num = 1
)
select warehouse_id, warehouse_name from current_data

Be Aware

There are several things to be aware of when working with System Tables:

Latency — System Tables are not populated in real time. There will typically be a small delay between an event occurring and a record appearing about that event in the system table.

Retention — Data is retained in System Tables for a specified time. Consult the documentation (link) where the data retention timeframe is listed. If a longer retention period is needed, you need to persist the data in your own data table.

Global or Regional — Some tables are global, containing data about all your workspaces in all regions (e.g., usage), whereas others are regional; they only contain data about workspaces in that region (e.g., clusters). If you need a global view of these tables, you must use Delta Sharing or another mechanism to consolidate this data from all regions. Whether they are global or regional is noted in the documentation (link).

Slow Changing Dimension — As noted above, some tables (e.g., clusters, list_prices) are Type 2 Slow Changing Dimensions. Knowing this will help you avoid getting seeming duplicate rows in your results.

Permissions — As noted in the section on enabling System Tables, you must be an account administrator to enable System Table schemas. When enabled, no users other than administrators will have access to these tables. I’d recommend limiting access to System Tables to administrators, at least initially. Standard USE and SELECT permission grants are needed to provide access to the schemas and tables to users.

Missing Pieces — As noted earlier, System Tables are in preview, so not all of the pieces are there yet. For instance, you’ll see in many of the tables that the workspace ID is there but not the workspace name, and there’s a clusters table but no warehouses table yet. See the example above for a query to get warehouse names from audit. See this awesome script to create a workspaces table and a users table.

An aside on this topic: If you want to create additional reference tables (e.g., a warehouse or workspace table) or your longer retention history tables, you’ll find that you can’t create schemas or tables in the system catalog. I recommend putting these additional tables in a single catalog (I typically use system_reporting so it shows up in the Catalog Explorer near the system catalog) in schemas named the same as the relevant system catalog schema.

How to Enable System Tables

As noted earlier, a Unity Catalog workspace is required for working with System Tables. You have to be an account administrator to enable System Tables. Each schema is enabled separately. To enable them, you can either use a curl command (link) or, the easier way, install the System Tables demo (link). The demo has a notebook (_enable_system_tables) to programmatically enable all currently available System Table schemas. Interestingly, you can also disable a System Table schema with a curl command, although I’m not sure why you would want to do this. Make a habit of rerunning the curl commands or notebook because as tables come out of Private Preview, new schemas may become available, which will not automatically be enabled.

Listing Available System Table Schemas and Their Current Status

curl -v -X GET -H "Authorization: Bearer <PAT Token>" "https://<workspace>.cloud.databricks.com/api/2.0/unity-catalog/metastores/<metastore-id>/systemschemas"

See this link on how to generate a PAT Token

To get the metastore id you can use the metastore details link (gear icon) in the Catalog Explorer

Enabling a System Table Schema

curl -v -X PUT -H "Authorization: Bearer <PAT Token>" 
"https://<workspace>.databricks.com/api/2.0/unity-catalog/metastores/<metastore-id>/systemschemas/<SCHEMA_NAME>"

Installing the System Tables Demo

%pip install dbdemos
import dbdemos
dbdemos.install('uc-04-system-tables')

Up Next

In the following post, I’ll cover the list_prices and usage tables in depth. These two tables will allow you to have more visibility into your usage and costs in the platform. Part 2

Other Resources

--

--