Building Lakeview Dashboards on Databricks — Best Practices

Databricks SQL SME
DBSQL SME Engineering
23 min readApr 25, 2024
End Dashboard Result

Author: Cody Austin Davis

Introduction

Lakeview is a new embedded Dashboarding tool within Databricks SQL that provides a rich set of capabilities for visualization, reporting, and sharing. Let us review some common best practices, tips, and tricks for building performant, robust, and beautiful Lakeview Dashboards!

Development Flow

Lakeview dashboards offer a slightly different development workflow than legacy DBSQL dashboards, and will ultimately have much more functionality. Historically, the concepts of dashboards and the queries that power them were decoupled and governed separately, making for a disjointed user experience for development, usage, and management of data analytics products. Because of this, it was harder to encapsulate and distribute end-to-end analytical dashboarding products easily. Today, the Lakeview Dashboard is a first-class citizen in the Databricks SQL product, and all the queries and visuals that power a given dashboard are governed together seamlessly. Now, most importantly, you can share Lakeview Dashboard templates (including queries, visuals, and dashboard layout) with a single click of a button! Lakeview Dashboards just went GA this month, so this is just the beginning of doing BI on Databricks SQL.

To get started developing a dashboard, check out an intro tutorial here. In this article, we are going to go a level deeper with some performance and usability tips and tricks, so the rest of this content will assume you have at least played around a bit with Lakeview Dashboards.

Overall, Lakeview Dashboards have 2 parts to them: the Data tab and the Canvas tab. As a user, you will write queries (or simply reference full tables) and create your data model that you can then access in the canvas tab to build visuals in the drag and drop interface. To help you build amazingly performant and functional dashboards, we are going to cover some best practices to get the most out of your data in both developing the queries as well as designing the visuals themselves.

Best Practices

To demonstrate these best practices, we are going to take a use case that we can all benefit from — analyzing billing data on our system tables. There are a ton of use cases that stem from system tables, but lets take a focused look of just trying to trend costs of our environment over time. Additionally, we will add some filtering to filter by product category, date range, as well as a dynamic filter to be able to look at cost for any combination of tags — which is an incredibly value real-world use case. Lets jump in!

We are going to create the following dashboard just with the system.billing.usage and system.billing.list_prices system table:

End Dashboard Result

It is short and simple, but actually packs a lot of logic! In this example, we will keep the metric scope small to focus on the performance and implementation best practices. We have the following metrics incorporated into the dashboard:

  1. Total Usage ($DBUs) — This is the most simple metric — this is simple the sum of all the usage in the selected period for our selected compute.
  2. T7, T30, T90, T365D Total Usage in $DBUs — This is just the trailing total usage of our currently selected compute (our filters are just usage_date, tag and product category in this example, but you can add filters/paramters for workspace, compute type, job_id, sku, owner, etc. etc.). We also calculate the daily average during each of these windows, this helps business users get a sense for how much usage is increasing in these windows relative to longer windows.
  3. T7, 30, 90, 365 Day Daily Moving Averages in $DBUs — This is the rolling daily moving averages for usage for the selected compute types. This is less commonly used but especially helpful for trending and visualizing extremely spiky workloads to see how they compare to the trends overall. Its also a good metric to demonstrate how to use more advanced window functions in Lakeview Dashboards.

Now lets dive into the implementation!

  • Note — the full queries are long, so the full SQL queries (2) are provided at the bottom of this article. You can also follow along and import this dashboard here (all resources provided in footnotes).

Throughout this article we are going to cover the following:

Performance & Usability Best Practices Topics:

  1. Leverage the AI Assistant directly in your Data Tab
  2. When to use parameters vs filters in your visuals
  3. Creative use of parameters for dynamic filtering
  4. Binding Multiple Queries to Parameters & Filters
  5. When to use Materialized Views for Complex Dataset Queries
  6. Table Formatting Cool Features

1. Building our data model with AI Assistant

Our dashboard is powered by 2 queries — one that provides the usage over time and calculates the moving averages, and one that provides the single-row trailing metrics. Our data model currently looks like this:

Data Tab for our Observability Dashboard

As you can see, we are doing a lot of logic here. If we want to express something in our dashboard but are not exactly sure how to frame the problem, we can use the Databricks AI Assistant right in our development workspace!

AI Assistant Button

This experience is amazingly helpful for literally 10x’ing our development speed of our dashboards. Let’s say we want to add a column and express something in english that we are not sure how to include in our current query. We just ask the assistant directly in our SQL workspace like so:

Building Lakeview Dashboards with Embedded AI

The AI assistant not only can convert english to SQL, but it can take the context and structure of your already existing query to help you much more easily refine and extend your logic. The more nuanced and complex your dashboards get, the more productive this AI assistant makes you, so even if you are a SQL master, you can benefit from including this into your dashboard development workflow.

2. When to Use Parameters Vs Filters

There are 2 ways to slice and filter data in Lakeview Dashboards: parameters and filters. They serve similar purposes but they provide different results and performance. This is the order of how the parameters and filters are applied in the life of a Lakeview Query:

Order of a Query in Lakeview

The parameters in your dashboard are defined, the query is then complied and sent to DBSQL to execute. Then, once the results are returned, the filters are applied and put into the visualizations. From this, we can create our dashboard to optimize for usability and performance by minimizing data movement from DBSQL to our Lakeview Client (web browser). To create and utilize parameters, all we have to do is use them in the query with the colon syntax (:parameter). Lets dive into our example:

Parameters in the Data Tab

In the data tab, you can create and define default values for the parameters. For filters, those only exist on the Canvas (client side) in the visuals themselves and are not a part of the underlying data model. In our example, we define 4 parameters: usage_start, usage_end, product_category, and tag_keys. The first 2 (usage_start and usage_end) allow us to filter the underlying data by usage date in our billing logs. When we do this via parameters, our dashboard will ONLY pull in data in our range defined in our selected parameters. If we chose to slice our usage view via filters, then the Lakeview Dashboard would query the full table (for smaller <64k datasets), pull it into the client (your computers memory), and filter it on the fly. This sends more data over the network and applies more memory pressure to your local machine. That being said, if the data is relatively small (< a few GBs of results), then using filters is more simple and fine for many use cases.

Filters also have the capability to know when to push the filters down to DBSQL and build the visual-level dataset server-side to ensure your browser does not pull in too much data. Generally, when the result is > 64k rows, Lakeview knows to push down the filters and visual measure creation to DBSQL, so the life of the query looks something like this:

WITH q as (...datasetQuery...
WHERE <:params>...
)
SELECT measures, dimensions
from q WHERE <Filter>
GROUP BY dimensions

This helps optimize data retrieval between DBSQL and your browser, but the order of when parameters vs filters are executed is still the same, so make sure to consider that when building your dashboard.

Overall, remember the following:

Parameters — selects the data before results are returned. Use when overall dataset is large and selectivity is high (when your dashboard results only need a small subset of the data). Also remember that you can bind parameters to multiple queries, so if you have 2+ queries that you want to use the same dimension filter for, parameters are the way to do that. Parameters are always executed FIRST and are part of the original dataset query.

Filters — Filters are in the Canvas tab only and filters the data after the query results have been returned to Lakeview for visuals. Filters can be pushed down to DBSQL automatically when the dataset result is large (>64k rows), but they are always applied AFTER parameters. This is much simpler and quicker for most dashboard use cases. If you only need to filter one query or if the overall selectivity is not very high, filters are a great option. They also offer more functionality in the visual layer.

3. Using Parameters for Advanced Dynamic Filtering

Parameters are not just for slicing and dicing. Remember, they are embedded directly into the source query definition itself, which unlocks quite a bit of creative liberty for BI developers. You can take the value of a parameter and use it to dynamically filter, allocate, categorize, and do really anything on the fly to your visual results. Its a bit vague, so lets build a really powerful and common example: tagging!

To demonstrate the immense flexibility we get from parameters, lets build the following use case:

As a user, I want to be able to dynamically trend usage by product category and also be able to trend usage for any combination of tags (of any compute type). For example, I want to be able to trend and summarize my usage for all compute with the following tags: “environment”: (prod/dev/test) and “system” (bi_app_1, de_pipeline, finance_etl, etc.).

We can simply add 2 parameters: product_category and tag_keys:

New Parameters

To create and utilize these, all we have to do is use them in the query with the colon syntax (:parameter) like so:

SELECT *
FROM system.billing.usage u
INNER JOIN px_all px ON px.sku_name = u.sku_name
WHERE
-- Date Filter
usage_date BETWEEN :usage_start AND :usage_end
AND
-- Product Filter
(CASE WHEN :product_category = 'all' THEN 1=1 ELSE
ARRAY_CONTAINS(split(lower(:product_category), ';'), lower(u.billing_origin_product))
END)

This query selects the billing table and uses the parameters to filter the table by usage date as well as only the product categories that match a string of semi-colon separated values in our product_category parameter. You can use any SQL function to parse and leverage the parameter. In the above exmaple, we parse the ; separated list and filter on whether of not our selected one-or-many categories matches the billing product type. This is already powerful but just the beginning of what we can do! Lets take it a step further to be able to filter our usage dashboard by ANY combination of tags for our clusters/jobs/warehouses:


---- NOT FULL QUERY
....
-- Now tag combos can be matched in 2 separate ways: key only, or the key=value pair if optionall provided
parsed_tagging_table AS (
WITH split_data AS (
SELECT
split(regexp_replace(:tag_keys, '\\s+', ''), ';') AS kv_pairs
),
exploded_data AS (
SELECT
monotonically_increasing_id() AS order_id,
explode(kv_pairs) AS kv_pair
FROM
split_data
),
clean_keys AS (

SELECT
split(kv_pair, '=')[0] AS key,
COALESCE(split(kv_pair, '=')[1], '') AS value,
kv_pair AS combination,
CASE WHEN contains(kv_pair, '=') THEN 1 ELSE 0 END AS ContainsValuePair
FROM
exploded_data
),

aggregated_data AS (
SELECT
collect_list(key) as keys,
collect_list(value) as values,
collect_list(CASE WHEN ContainsValuePair = 1 THEN combination END) as KVCombos
FROM
clean_keys
),

clean_aggs AS (
SELECT
map_from_arrays(keys, values) AS result_map,
KVCombos
FROM
aggregated_data LIMIT 1
)

SELECT
-- map with only keys
MAP_FILTER(result_map, (key, value) -> value = '') AS filtered_map,
-- entries that need to match both key and value pairs
KVCombos
FROM clean_aggs
),
---..... Rest of query

This is a complex CTE, but what it does is parse a semi-colon-separated list of key=value pairs, with the values being optional. This allows users to lookup ANY combination of tag key value combos and trend usage for the subset of their jobs. This is an amazingly powerful use case for enterprises to be able to see their costs across their own custom tagging policies.

This gives us the following dashboard:

Looking up all usage with the “demo” tag key and “project” tag key where project = “dbdemos”

This dashboard shows how flexible this parameter is. It allows us to clearly and easily trend usage for a specific tagged set of workloads and data products within and environment. Combined with our product_category filter, we are already doing some serious work with just 2 parameters! In our example above, we are able to get a very clear picture of usage trends for a specific set of tag keys and even optional value pairs. For example, we can track how much a particular business use cases costs, downstream customer, team, or any custom attribute we can tag on.

4. Binding Multiple Queries to Parameters & Filters

In most dashboards, users will typically have multiple queries that make up that dashboard. For our example, we have 2. What if we want to create a single set of parameters and apply them to multiple queries at once? Well you can do just that.

In our example, we have 4 parameters and 2 queries, and both queries actually use the same attributes. All we have to do is define the parameters for each query (each query independently owns their parameters) and bind each query parameter to the parameter card in the canvas like so:

Click on the paramter card in canvas
Click the “+” sign to bind multiple query paramters to a single control in the canvas

That is all! After that, all visuals that use either the “Trailing Billing Summary” of “Billing Momentum Indicators” queries will now filter by the same control in the canvas. You can also do this exact same process with filters, so this allows you to easily share and unify your dashboard filtering experience across your entire dashboard.

Now we can be more strategic about how we separate analytical results and queries while saving space and removing the need to duplicate data and metrics all stuffed into a single query. This is technique is crucial to minimizing the size of your result data model.

5. When to use Materialized Views for Lakeview Dashboards

Materialized Views are crucial for any BI application. In general, if you have larger/more complex logic in your data model layer (Data tab), then you should consider making it a materialized view. When you do this, you get the following additional benefits:

  1. Allows the results to be computed once and stay up to date as needed
  2. Completely removes duplicate / repetitive calculations
  3. Makes the BI development process simpler for some use cases
  4. Calculations are automatically incremental (when possible)
  5. Allows for more governable refresh of data in a “push” method instead of always querying the latest data — reducing serving load drastically

Lets take our example query (full query reference at bottom of article) we have shown some of above. We are doing parameter parsing, calculating moving averages and trailing metrics using complex and expensive window and sort functions. This is the exact kind of logic we would want to make into a materialized view. In our example, we would create our materialized view like so:

Moving Averages MV:


CREATE MATERIALIZED VIEW main.default.usage_dashboard_smoothed_moving_averages_mv
SCHEDULE CRON '0 * * * *'
COMMENT 'Moving Averages for usage updated every hour on the hour'
AS (
WITH px_all AS (
SELECT DISTINCT
sku_name,
pricing.default AS unit_price,
CASE
WHEN sku_name LIKE ('%ALL_PURPOSE%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%DLT%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%JOBS%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%SQL%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
ELSE unit_price::float
END AS sku_price
FROM system.billing.list_prices
QUALIFY ROW_NUMBER() OVER (PARTITION BY sku_name ORDER BY price_start_time DESC) = 1
),

filtered_usage AS (
SELECT u.*,
usage_quantity*px.sku_price AS DollarDBUs
FROM system.billing.usage u
INNER JOIN px_all px ON px.sku_name = u.sku_name

),

clean_usage AS (
SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(DollarDBUs) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM filtered_usage u
GROUP BY usage_date
)

-- Final Query
SELECT
usage_date,
DollarDBUs,

AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS avg_7d,
AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS avg_30d,
AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 89 PRECEDING AND CURRENT ROW
) AS avg_90d,
AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 364 PRECEDING AND CURRENT ROW
) AS avg_365d
FROM -- Roll up to daily level, keep nice a simple for OOTB use cases
(SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(DollarDBUs) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM filtered_usage u
GROUP BY usage_date
)
-- Optional, depending on your needs: WHERE clause to filter rows
ORDER BY
usage_date
);

Trailing 7,30,90,365 day aggregate metrics MV:

CREATE MATERIALIZED VIEW main.default.usage_dashboard_mv
SCHEDULE CRON '0 * * * *'
COMMENT 'Trailing Metrics for usage updated every hour on the hour'
AS
(
WITH px_all AS (
SELECT DISTINCT
sku_name,
pricing.default AS unit_price,
CASE
WHEN sku_name LIKE ('%ALL_PURPOSE%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%DLT%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%JOBS%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%SQL%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
ELSE unit_price::float
END AS sku_price
FROM system.billing.list_prices
QUALIFY ROW_NUMBER() OVER (PARTITION BY sku_name ORDER BY price_start_time DESC) = 1
),

clean_usage AS (
SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(usage_quantity*px.sku_price) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM system.billing.usage u
INNER JOIN px_all px ON px.sku_name = u.sku_name
GROUP BY usage_date
),

T7D_query AS (
(SELECT SUM(DollarDBUs) AS T7D FROM clean_usage WHERE rn <= 7)
),
T30D_query AS (
(SELECT SUM(DollarDBUs) AS T30D FROM clean_usage WHERE rn <= 30)
),
T90D_query AS (
(SELECT SUM(DollarDBUs) AS T90D FROM clean_usage WHERE rn <= 90)
),
T365D_query AS (
(SELECT SUM(DollarDBUs) AS T365D FROM clean_usage WHERE rn <= 365)
)


-- Final Query
SELECT DISTINCT
-- Calculate Trailing usage right here in subueries
(SELECT MAX(T7D) FROM T7D_query) AS T7D_Dollars,
(SELECT MAX(T30D) FROM T30D_query) AS T30D_Dollars,
(SELECT MAX(T90D) FROM T90D_query) AS T90D_Dollars,
(SELECT MAX(T365D) FROM T365D_query) AS T365D_Dollars,
try_divide(T7D_Dollars, 7) AS DailyAvgT7D,
try_divide(T30D_Dollars, 30) AS DailyAvgT30D,
try_divide(T90D_Dollars, 90) AS DailyAvgT90D,
try_divide(T365D_Dollars, 365) AS DailyAvgT365D
LIMIT 1
)
;

Now if we replace querying the above MVs with our original source queries, the dashboard will be much faster and more efficient!

If you use a materialized view — this can drastically speed up dashboards, which is great for a large majority of use cases. Most BI use cases are static dashboards that just need to be refreshed on a cadence, and materialized views are AMAZING in these use cases to save on both cost and performance. Instead of re-querying and re-running all the above complex logic PER refresh/parameter change, we compute the results once, and ALL downstream users that need this data do not waste time and money re-computing the same results.

Using Materialized Views for your serving layer is a more advanced application, so when designing your BI serving layer, it is important to consider and plan for how often you need your data to be refreshed as well as how your users need to access the underlying data. There are tradeoffs, so lets highlight some of those considerations with our dashboard example.

When to not use Materialized Views:

  1. Dynamic Filtering / Parameters — When you need users to have much more interactivity and ability to slice & dice by many dimensions in the source data, materialized views can limit your ability to do that in cases where you aggregate away dimensions. For example, in the above materialized views, we can no longer dynamically filter by tag/product_type or others, but we can still filter on usage date because it is still a dimension in our final MV.
  2. Small Data / Lots and Lots of Views — If the overhead of defining and managing the MVs becomes more than the benefit of query performance (i.e. ad-hoc reporting, ephemeral dashboards, etc.), don’t worry about increasing the complexity of your BI stack and just use queries.
  3. Users require low-latency data freshness — This one is not exactly a hard and fast rule, but more of something to think about. Users can get confused if they are expecting real-time data but the underlying MV is refreshed on a slower cadence (daily/hourly). If you truly have a low latency use case (< 1 minute) then it may be more beneficial to simply query the table as long as your filters are selective enough. However for most BI uses cases, this is more of a planning and communication problem to handle when you are designing your BI strategy. Keep in mind that MVs can refresh at just about any rate, so just make sure to set and communicate the freshness of the data explicitly to end users. One easy way to do this communicate data freshness to add a refresh_timestamp column in your MV and bake it into the dashboard like so:


CREATE MATERIALIZED VIEW main.default.usage_dashboard_mv
SCHEDULE CRON '0 * * * *'
COMMENT 'Trailing Metrics for usage updated every hour on the hour'
AS
(
WITH px_all AS (
SELECT DISTINCT
sku_name,
pricing.default AS unit_price,
CASE
WHEN sku_name LIKE ('%ALL_PURPOSE%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%DLT%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%JOBS%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%SQL%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
ELSE unit_price::float
END AS sku_price
FROM system.billing.list_prices
QUALIFY ROW_NUMBER() OVER (PARTITION BY sku_name ORDER BY price_start_time DESC) = 1
),

clean_usage AS (
SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(usage_quantity*px.sku_price) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM system.billing.usage u
INNER JOIN px_all px ON px.sku_name = u.sku_name
GROUP BY usage_date
),

T7D_query AS (
(SELECT SUM(DollarDBUs) AS T7D FROM clean_usage WHERE rn <= 7)
),
T30D_query AS (
(SELECT SUM(DollarDBUs) AS T30D FROM clean_usage WHERE rn <= 30)
),
T90D_query AS (
(SELECT SUM(DollarDBUs) AS T90D FROM clean_usage WHERE rn <= 90)
),
T365D_query AS (
(SELECT SUM(DollarDBUs) AS T365D FROM clean_usage WHERE rn <= 365)
)


-- Final Query
SELECT DISTINCT
-- Calculate Trailing usage right here in subueries
(SELECT MAX(T7D) FROM T7D_query) AS T7D_Dollars,
(SELECT MAX(T30D) FROM T30D_query) AS T30D_Dollars,
(SELECT MAX(T90D) FROM T90D_query) AS T90D_Dollars,
(SELECT MAX(T365D) FROM T365D_query) AS T365D_Dollars,
try_divide(T7D_Dollars, 7) AS DailyAvgT7D,
try_divide(T30D_Dollars, 30) AS DailyAvgT30D,
try_divide(T90D_Dollars, 90) AS DailyAvgT90D,
try_divide(T365D_Dollars, 365) AS DailyAvgT365D,
!!!!
current_timestamp() as refresh_timestamp -- !!! How to communicate data freshness.
-- Put this columnm in a card in your Lakeview Dashboard!
-- No manually communication needed!
LIMIT 1
)
;
Final Output for our Lakeview Dashboard

Now all we need to do to set expectations with our end users is add the refresh column to our dashboard!

6. Neat Formatting Tricks in the Table Visual

This is more of a neat bonus that I personally use often for easy-to-read table reports and outputs. If you have a use case for the “Table” visual, you can implement all sorts of dynamic and conditional formatting. Here are some of the most commonly used options:

Example of conditional formatting

There are quite a few cool things in here, so lets unpack them!

  1. Conditional Coloring: If you pick a table visual and select a column, you can format the coloring based on a value like so:
setting up conditional formatting

2. Put anything into a table column- You can technically put almost anything into a table column including links, JSON, and even images. This, combined with the ability to write custom python + SQL UDFs, allows you near infinite ability to customize your dashboards, including making more visuals inside your table. Get creative! All you have to do is change the display setting in the column selection:

3. Use columns for searching — You can enable the ability to search the table by column values by simply clicking the “use column for search” button in the column-level config:

4. Add helper comments to columns — If you input a description into the query column, it will insert a tooltip in the visual for that column. This is really helpful for guiding and education users on how to user and interpret your dashboard.

Tooltip from description column
How to add description column

Conclusion

These are some of our favorite initial tips, trick, and best practices, but there is MUCH more to come out of Lakeview Dashboards, so stay subscribed an on the lookout here for more tips as they come out! If you have a cool dashboard and you want to blog about it, reach out to us if you want to feature it on this blog!

Footnotes:

Full Dashboard Lakeview Template here.

Full Dashboard SQL Queries (with dynamic parameters):

  1. Moving Averages:
/*
This query actually filters on the same tagging framework as the other query as well as date range and product category
and trends trailing usage metrics as well as moving averages for them
*/


WITH px_all AS (
SELECT DISTINCT
sku_name,
pricing.default AS unit_price,
CASE
WHEN sku_name LIKE ('%ALL_PURPOSE%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%DLT%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%JOBS%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%SQL%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
ELSE unit_price::float
END AS sku_price
FROM system.billing.list_prices
QUALIFY ROW_NUMBER() OVER (PARTITION BY sku_name ORDER BY price_start_time DESC) = 1
),


-- Now tag combos can be matched in 2 separate ways: key only, or the key=value pair if optionall provided
parsed_tagging_table AS (
WITH split_data AS (
SELECT
split(regexp_replace(:tagging_raw_architecture, '\\s+', ''), ';') AS kv_pairs
),
exploded_data AS (
SELECT
monotonically_increasing_id() AS order_id,
explode(kv_pairs) AS kv_pair
FROM
split_data
),
clean_keys AS (

SELECT
split(kv_pair, '=')[0] AS key,
COALESCE(split(kv_pair, '=')[1], '') AS value,
kv_pair AS combination,
CASE WHEN contains(kv_pair, '=') THEN 1 ELSE 0 END AS ContainsValuePair
FROM
exploded_data
),

aggregated_data AS (
SELECT
collect_list(key) as keys,
collect_list(value) as values,
collect_list(CASE WHEN ContainsValuePair = 1 THEN combination END) as KVCombos
FROM
clean_keys
),

clean_aggs AS (
SELECT
map_from_arrays(keys, values) AS result_map,
KVCombos
FROM
aggregated_data LIMIT 1
)

SELECT
-- map with only keys
MAP_FILTER(result_map, (key, value) -> value = '') AS filtered_map,
-- entries that need to match both key and value pairs
KVCombos
FROM clean_aggs
),


filtered_usage AS (
SELECT u.*,
usage_quantity*px.sku_price AS DollarDBUs,

-- Custom Tagging Filtering System
(SELECT MAX(size(map_keys(filtered_map)) + size(KVCombos)) FROM parsed_tagging_table) AS TotalPolicyTags,

-- When only keys are provided, link keys, but when result_map has a value, check the set of the whole k=v pair
(
ARRAY_DISTINCT(
CONCAT(
array_intersect(map_keys(u.custom_tags), (SELECT MAX(map_keys(filtered_map)) FROM parsed_tagging_table))
,array_intersect(
TRANSFORM(
MAP_KEYS(custom_tags),
key -> CONCAT(key, '=', custom_tags[key])
), (SELECT MAX(KVCombos) FROM parsed_tagging_table)
)
)
)
) AS CompliantTagKeys,

array_except((SELECT MAX(map_keys(filtered_map)) FROM parsed_tagging_table), map_keys(u.custom_tags)) AS MissingTagKeys,
size(CompliantTagKeys) AS NumberOfCompliantKeys,
array_join(CompliantTagKeys, '_') AS TagPolicyKeys,

array_join(
CONCAT(transform(CompliantTagKeys, key -> CONCAT(key, '=', u.custom_tags[key])), -- Get Compliant keys without value pair
FILTER(CompliantTagKeys, x -> POSITION('=' IN x) > 0) -- Pull out the compliant values with the key pair
)
, ';') AS CompliantValues,

CASE
WHEN length(:tagging_raw_architecture) = 0 OR :tagging_raw_architecture IS NULL OR :tagging_raw_architecture = 'all' THEN 'Compliant'
WHEN NumberOfCompliantKeys >= TotalPolicyTags
THEN 'Compliant'
ELSE 'Not Compliant To Tag Policy' END AS IsTaggingCompliant
FROM system.billing.usage u
INNER JOIN px_all px ON px.sku_name = u.sku_name
WHERE
-- Date Filter
usage_date BETWEEN :usage_start AND :usage_end
AND
-- Product Filter
(CASE WHEN :product_category = 'all' THEN 1=1 ELSE
ARRAY_CONTAINS(split(lower(:product_category), ';'), lower(u.billing_origin_product))
END)

),

clean_usage AS (
SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(DollarDBUs) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM filtered_usage u
WHERE IsTaggingCompliant = 'Compliant'
GROUP BY usage_date
)

-- Final Query
SELECT
usage_date,
DollarDBUs,

AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS avg_7d,
AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS avg_30d,
AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 89 PRECEDING AND CURRENT ROW
) AS avg_90d,
AVG(DollarDBUs) OVER (
ORDER BY usage_date
ROWS BETWEEN 364 PRECEDING AND CURRENT ROW
) AS avg_365d
FROM -- Roll up to daily level, keep nice a simple for OOTB use cases
(SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(DollarDBUs) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM filtered_usage u
WHERE IsTaggingCompliant = 'Compliant'
GROUP BY usage_date
)
-- Optional, depending on your needs: WHERE clause to filter rows
ORDER BY
usage_date;
  1. Summary Trailing Metrics:
/*
This query actually filters on the same tagging framework as the other query as well as date range and product category
and trends trailing usage metrics as well as moving averages for them
*/


WITH px_all AS (
SELECT DISTINCT
sku_name,
pricing.default AS unit_price,
CASE
WHEN sku_name LIKE ('%ALL_PURPOSE%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%DLT%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%JOBS%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
WHEN sku_name LIKE ('%SQL%') THEN (unit_price::float * (1-0.0)::float)::float -- Input SKU level discounts here
ELSE unit_price::float
END AS sku_price
FROM system.billing.list_prices
QUALIFY ROW_NUMBER() OVER (PARTITION BY sku_name ORDER BY price_start_time DESC) = 1
),


-- Now tag combos can be matched in 2 separate ways: key only, or the key=value pair if optionall provided
parsed_tagging_table AS (
WITH split_data AS (
SELECT
split(regexp_replace(:tagging_raw_architecture, '\\s+', ''), ';') AS kv_pairs
),
exploded_data AS (
SELECT
monotonically_increasing_id() AS order_id,
explode(kv_pairs) AS kv_pair
FROM
split_data
),
clean_keys AS (

SELECT
split(kv_pair, '=')[0] AS key,
COALESCE(split(kv_pair, '=')[1], '') AS value,
kv_pair AS combination,
CASE WHEN contains(kv_pair, '=') THEN 1 ELSE 0 END AS ContainsValuePair
FROM
exploded_data
),

aggregated_data AS (
SELECT
collect_list(key) as keys,
collect_list(value) as values,
collect_list(CASE WHEN ContainsValuePair = 1 THEN combination END) as KVCombos
FROM
clean_keys
),

clean_aggs AS (
SELECT
map_from_arrays(keys, values) AS result_map,
KVCombos
FROM
aggregated_data LIMIT 1
)

SELECT
-- map with only keys
MAP_FILTER(result_map, (key, value) -> value = '') AS filtered_map,
-- entries that need to match both key and value pairs
KVCombos
FROM clean_aggs
),


filtered_usage AS (
SELECT u.*,
usage_quantity*px.sku_price AS DollarDBUs,

-- Custom Tagging Filtering System
(SELECT MAX(size(map_keys(filtered_map)) + size(KVCombos)) FROM parsed_tagging_table) AS TotalPolicyTags,

-- When only keys are provided, link keys, but when result_map has a value, check the set of the whole k=v pair
(
ARRAY_DISTINCT(
CONCAT(
array_intersect(map_keys(u.custom_tags), (SELECT MAX(map_keys(filtered_map)) FROM parsed_tagging_table))
,array_intersect(
TRANSFORM(
MAP_KEYS(custom_tags),
key -> CONCAT(key, '=', custom_tags[key])
), (SELECT MAX(KVCombos) FROM parsed_tagging_table)
)
)
)
) AS CompliantTagKeys,

array_except((SELECT MAX(map_keys(filtered_map)) FROM parsed_tagging_table), map_keys(u.custom_tags)) AS MissingTagKeys,
size(CompliantTagKeys) AS NumberOfCompliantKeys,
array_join(CompliantTagKeys, '_') AS TagPolicyKeys,

array_join(
CONCAT(transform(CompliantTagKeys, key -> CONCAT(key, '=', u.custom_tags[key])), -- Get Compliant keys without value pair
FILTER(CompliantTagKeys, x -> POSITION('=' IN x) > 0) -- Pull out the compliant values with the key pair
)
, ';') AS CompliantValues,

CASE
WHEN length(:tagging_raw_architecture) = 0 OR :tagging_raw_architecture IS NULL OR :tagging_raw_architecture = 'all' THEN 'Compliant'
WHEN NumberOfCompliantKeys >= TotalPolicyTags
THEN 'Compliant'
ELSE 'Not Compliant To Tag Policy' END AS IsTaggingCompliant
FROM system.billing.usage u
INNER JOIN px_all px ON px.sku_name = u.sku_name
WHERE
-- Date Filter
usage_date BETWEEN :usage_start AND :usage_end
AND
-- Product Filter
(CASE WHEN :product_category = 'all' THEN 1=1 ELSE
ARRAY_CONTAINS(split(lower(:product_category), ';'), lower(u.billing_origin_product))
END)

),

clean_usage AS (
SELECT usage_date, SUM(usage_quantity) AS DBUs,
SUM(DollarDBUs) AS DollarDBUs,
ROW_NUMBER() OVER (ORDER BY usage_date DESC) AS rn
FROM filtered_usage u
WHERE IsTaggingCompliant = 'Compliant'
GROUP BY usage_date
),

T7D_query AS (
(SELECT SUM(DollarDBUs) AS T7D FROM clean_usage WHERE rn <= 7)
),
T30D_query AS (
(SELECT SUM(DollarDBUs) AS T30D FROM clean_usage WHERE rn <= 30)
),
T90D_query AS (
(SELECT SUM(DollarDBUs) AS T90D FROM clean_usage WHERE rn <= 90)
),
T365D_query AS (
(SELECT SUM(DollarDBUs) AS T365D FROM clean_usage WHERE rn <= 365)
)


-- Final Query
SELECT DISTINCT
-- Calculate Trailing usage right here in subueries
(SELECT MAX(T7D) FROM T7D_query) AS T7D_Dollars,
(SELECT MAX(T30D) FROM T30D_query) AS T30D_Dollars,
(SELECT MAX(T90D) FROM T90D_query) AS T90D_Dollars,
(SELECT MAX(T365D) FROM T365D_query) AS T365D_Dollars,
try_divide(T7D_Dollars, 7) AS DailyAvgT7D,
try_divide(T30D_Dollars, 30) AS DailyAvgT30D,
try_divide(T90D_Dollars, 90) AS DailyAvgT90D,
try_divide(T365D_Dollars, 365) AS DailyAvgT365D
LIMIT 1

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL