Photo by Jon Tyson on Unsplash

Databricks System Tables Overview — the List Prices and Usage Tables

Rob Wesley

--

Part 2

Note: The views expressed in this blog are my own and do not represent the official stance of Databricks.

What are Databricks System Tables, and how do you use them? I want to provide some answers to these questions in a series of posts. My intention is to cover them in depth, so I’ll be covering one to two tables in each post, providing details on the tables and example queries for these tables. This is part 2 of the series. Part 1 provided an overview of System Tables, things to be aware of when working with them, and how to enable them in your workspace.

These are intended to be introductory-level posts, and I will provide SQL queries for these tables. However, these queries are written to be understandable rather than focused on performance. If you have a lot of experience with SQL, you’ll have different ways of writing the queries and the ability to write much more complex queries against these tables.

Overview

From part 1, here is a recap of the two tables we’re going to look at in this post:

system.billing.list_prices — A Global Type 2 slow-changing dimension providing the list price for all SKUs for your cloud. If you need to know current or historical pricing for a particular SKU (e.g., what’s the price difference with and without Photon?). As noted above, this typically gets joined with usage to calculate the billable dollars spent. Note that table has list prices. If your company has any discounts they will not be reflected in this table.

system.billing.usage — This provides global data of all billable usage in your cloud account. It is aggregated by the hour. If you want to know how many DBUs are being spent by day/month/cluster/warehouse/SKU this is the table to answer those questions. Join with the pricing table to calculate Dollars from DBUs. This is currently the primary fact table in System Tables.

List Prices

This link provides Databricks documentation for system.billing.list_prices. The first step is to run the following command in an SQL editor. This will give us a list of the table’s columns and datatypes.

describe table system.billing.list_prices

The table provides the history of pricing for a usage_unit of a sku_name in a specific currency_code. As mentioned above, list_prices is a Type 2 Slow Changing Dimension table. The columns price_start_time and price_end_time are used to find when a SKU’s price changes. For the current price, price_end_time is null. The only value in the currency_code currently is USD (US Dollars). The primary usage_unit is DBU (Databricks Unit), which is a measure of processing capability in the Databricks Platform. The only other usage_unit currently is GPU_HOUR which is the unit used for Databricks Mosaic model training. To find the price value you will need to extract the default field in the pricing structure. See the first example query below for how to do that. And again, these are list prices.

List Price Questions

What are some questions that list_prices can answer just by itself?

What are all of the available SKUs, the current list price for each, and when did the price take effect?

select
sku_name,
pricing.default,
price_start_time,
usage_unit
from
system.billing.list_prices
where
price_end_time is null -- current price for a sku
order by
sku_name;

For this query, we only want current list prices so we filter the data on price_end_time being null. Using pricing.default extracts the default field from pricing.

What are all of the photon SKUs and their list prices over time?

select
sku_name,
pricing.default,
price_start_time,
price_end_time,
usage_unit
from
system.billing.list_prices
where
upper(sku_name) like '%PHOTON%'
order by
sku_name,
price_start_time;

Since we want history, we don’t filter on the price_start_time or price_end_time, but since we only want photon SKUs, we filter on that. If you don’t know how something is cased, it’s always a good idea to apply the upper( ) or lower( ) functions to the column.

What SKUs have ever had a price change?

with skus as (
select
distinct sku_name -- distinct because it might have had multiple price changes
from
system.billing.list_prices
where
price_end_time is not null -- sku has had at least one price change
)
select
lp.sku_name,
pricing.default,
price_start_time,
price_end_time,
usage_unit
from
system.billing.list_prices lp
inner join skus s on s.sku_name = lp.sku_name
order by
sku_name,
price_start_time;

In this query, we use a common table expression (CTE) to first get the SKUs that have had a price change (price_end_time not null indicates a non-current price record and, therefore, the price changed). Note that we also use a distinct to get each sku_name only one time, otherwise if there were multiple price changes you would get the same sku_name multiple times. We then join our CTE to list_prices by sku_name to get the relevant rows, ordering them by the sku_name and the price_start_time to see the history of price changes for each of these SKUs.

Usage

This link provides Databricks documentation for system.billing.usage. As before, let’s run the following command to see the table columns and datatypes.

Describe table system.billing.usage

As you can see this is a much more involved table, especially when you notice the usage_metadata and product_features columns are structures with quite a few fields (if you’re not familiar with structures, you can get more information on the struct datatype at this link). We’ll spend some time going through the usage columns as there is quite a bit of nuance to this table.

The account_id column will be the same value for all rows. As noted earlier, usage is a Global table, it contains usage data for all workspaces in the account, so the workspace_id column will indicate what workspace the usage occurred in, that workspace may be a different workspace in the same region (e.g., dev, staging, production workspaces) or it may be a workspace in another region than the workspace where the query was run (e.g., another team or business unit’s workspace(s). See the Part 1 post for a link to a script to build a workspaces table to be able get the workspace_name from workspace_id. The record_id column is a unique value and can be used as a primary key. The usage_date column can be used to easily aggregate data by date.

Joining to list_prices and other dimension tables

The sku_name column will provide the SKU for the usage and can be used to join list_prices . You’ll need to use usage_date in combination with price_start_time and price_end_time to make sure to get the current list_prices (more about joining to list_prices in the example queries). The usage_unit column is also used to join the list_prices table although it’s redundant since currently all sku_names have only one usage_unit. Note that usage is a partially denormalized fact table. If it was fully normalized, you wouldn’t have the sku_name, usage_unit, and cloud columns in usage, as these columns are already in list_prices and would be referenced by joining to list_prices with a surrogate key aka a list_price ID (but note the current implementation of list_prices doesn’t have this surrogate key). Having sku_name, usage_unit, and cloud in usage is a convenience to reduce the need to join when you’re only interested in usage_units and not the price. Usage is only partially denormalized because other columns (e.g., workspace_id, and usage_metadata fields) are id columns used to join to dimension tables.

Columns: usage_start_time and usage_end_time

Each row in usage represents the number of usage_units (usually DBUs) used during an hourly timespan for a “compute element”. The usage_start_time and usage_end_time note that hour’s timespan. You should notice that these two columns are always an hour apart.
The use of the phrase “compute element” is not accidental. A compute element may represent a complete cluster or warehouse, it may represent just the driver or just the workers in a cluster, or it can represent serverless usage during that hour’s timespan. The following discussion relates to non-serverless workloads. For details on serverless workload usage, please see this link. If the node type for a cluster’s driver and workers is the same then there will be one row for the the aggregated usage for the whole cluster. But if the node type for a cluster’s driver is different than its workers, then there will be two rows for the aggregated usage: one row is for the driver’s usage and one row is for the workers’ usage. A cluster with different driver and worker node types running continuously for a day would have 48 rows of data. Twenty-four rows represent the usage of the workers’ node type, and 24 rows represent the usage of the driver’s node type. A cluster with the same node type for the driver and workers would have 24 rows.

The node_type will be contained in the usage_metadata column which we discuss next.

Column: usage_metadata

The usage_metadata column is a structure with the following fields:
cluster_id, job_id, warehouse_id, instance_pool_id, node_type, job_run_id, notebook_id, dlt_pipeline_id, endpoint_name, endpoint_id, dlt_update_id, dlt_maintenance_id.

Different types of clusters and warehouses, both serverless and non-serverless, will have different usage_metadata fields populated.

instance_pool_id

A cluster created from instance pool nodes will have the instance_pool_id field populated

node_type

As described above, a cluster created with a single node type for both driver and worker will be represented by one row for each hour in the day it’s used. A cluster created with different driver and worker node types will have two rows for each hour of the day that it’s used. One row will have the node_type of the driver and the second row will have the node_type of the workers.

cluster_id

This provides the cluster ID for the cluster. Determining if the cluster is interactive (all-purpose) or a jobs cluster is based on the job_id field.

job_id

If the cluster is interactive, the job_id field will be null. (see note below)
If the cluster is a jobs cluster then the job_id field will be populated.

Note: An interactive cluster used in a job will not have a job_id populated. To prevent this issue and provide cost savings, it’s recommended to run jobs on jobs compute.

job_run_id

If the cluster is a serverless job cluster then the job_run_id will be populated.
For non-serverless job’s run prior to 2024–03–28 job_run_id will be null. Non-serverless job clusters run after 2024–03–28 will have the job_run_id populated

warehouse_id

If a DBSQL Warehouse (serverless or non-serverless) is used, then the warehouse_id will be populated. The cluster_id and job_id will always be null when the warehouse_id is populated.

notebook_id

A notebook_id that is non-null indicates usage of notebook serverless. The other ID fields will be null if this field is populated.

dlt_pipeline_id

A dlt_pipeline_id that is non-null shows the usage for a DLT pipeline. If the cluster_id is null, it indicates serverless DLT and non-null cluster_id indicates non-serverless DLT. Job_id and job_run_id will always be null.

endpoint_name

endpoint_name is non-null for inference using a Vector Search endpoint.

endpoint_id

endpoint_id is non-null for inference using a Vector Search endpoint

dlt_update_id

dlt_update_id will be null if dlt_pipeline_id is null. For DLT pipelines this will be populated with the DLT update.

dlt_maintenance_id

dlt_update_id will be null if dlt_pipeline_id is null. For DLT pipelines this will be populated with the DLT maintenance.

Column: identity_metadata.run_id

This column, populated since late March 2024, indicates the user ID responsible for the usage. It’s most useful for job runs or serverless usage.

Column: billing_origin_product and product_features

These two columns help to differentiate the SKU used. Some SKUs are used for multiple products and these columns will identify the specific product and product features being used.

Usage Questions

You will find that much of your use of System Tables flows through usage. Below are some example questions and queries to answer them. In future posts, I will discuss other tables (audit, clusters, node_types) and show how those tables can also be used with usage.

What are the total DBUs used by date and by SKUs for DBSQL for the account (all workspaces)?

select
u.usage_date,
u.sku_name,
sum(u.usage_quantity) as total_dbus
from
system.billing.usage u
where
upper(u.sku_name) like '%SQL%'
group by
all
order by
total_dbus desc

When you run this query you will notice that DBSQL Serverless SKUs are separated by cloud region, which is part of the SKU name.

What is the DBU spend for each workspace by month for DBSQL Serverless?

select
u.workspace_id,
date_trunc('month', u.usage_date) as usage_month,
sum(u.usage_quantity) as total_dbus
from
system.billing.usage u
where
upper(u.sku_name) like '%SERVERLESS_SQL%'
group by
all
order by
1,
2

Note that this query uses the workspace_id. As described earlier, if you have built a workspaces table (link) then you could join that table by workspace_id to get the workspace_name.

What are the total DBUs and total dollars (list price) used by date and by SKU for DBSQL for the account (all workspaces)?

select
u.usage_date,
u.sku_name,
sum(u.usage_quantity) as total_dbus,
sum(lp.pricing.default * u.usage_quantity) as list_cost
from
system.billing.usage u
inner join system.billing.list_prices lp on u.cloud = lp.cloud
and u.sku_name = lp.sku_name
and u.usage_start_time >= lp.price_start_time
and (
u.usage_end_time <= lp.price_end_time
or lp.price_end_time is null
)
where
upper(u.sku_name) like '%SQL%'
group by
all
order by
total_dbus desc

This query joins list_prices to usage to calculate total dollars. Note that this query uses usage_start_time and usage_end_time to get the current price for each SKU from list_prices (remember that price_end_time can be null).

How can we group DBUs and total dollars (list price) spent into five categories (JOBS, ALL PURPOSE, SQL, INFERENCE, OTHER) by date?

SELECT
date_trunc ('month', usage_date) as usage_month,
CASE
WHEN upper(u.sku_name) LIKE '%JOBS%' THEN 'JOBS'
WHEN usage_metadata.job_id IS NOT NULL THEN 'JOBS'
WHEN upper(u.sku_name) LIKE '%ALL_PURPOSE%'
OR usage_metadata.cluster_id IS NOT NULL THEN 'ALL PURPOSE'
WHEN upper(u.sku_name) LIKE '%SQL%'
OR usage_metadata.warehouse_id IS NOT NULL THEN 'SQL'
WHEN upper(u.sku_name) LIKE '%INFERENCE%' THEN 'INFERENCE'
ELSE 'OTHER'
END AS work_type,
SUM(usage_quantity) AS total_dbus,
SUM(lp.pricing.default * u.usage_quantity) as list_cost
FROM
system.billing.usage u
inner join system.billing.list_prices lp on u.cloud = lp.cloud
and u.sku_name = lp.sku_name
and u.usage_start_time >= lp.price_start_time
and (
u.usage_end_time <= lp.price_end_time
or lp.price_end_time is null
)
GROUP BY
1,
2
ORDER BY
1,
2

This query uses usage_metadata.job_id, usage_metadata.cluster_id, usage_metadata.warehouse_id and sku_name to classify usage into five work types. This query could easily be put into a Lakeview dashboard to provide a visualization of your account usage by work type. A nice feature of Lakeview dashboards is that you can aggregate time-based data differently than the query aggregates it (e.g., by week, by month). In a single dashboard, you could have multiple charts that show roll-ups of the data by the different time periods all using the data from this single query. See this post for more information on Lakeview dashboards. We will revisit this query when we look at system.compute.clusters to further differentiate usage by specific clusters and warehouses.

Up Next

In the following post, I’ll cover the node_types, and clusters tables in depth. We’ll also see how to use these tables to aggregate usage by clusters and warehouses. Part 3

Other Resources

--

--