Lakeview Dashboards for Observability Series — Article #1 DBSQL Warehouse Advisor

Databricks SQL SME

Published in

DBSQL SME Engineering

12 min readMay 18, 2024

DBSQL Warehouse Advisor Dashboard Template — Scaling Summary

Author: Cody Austin Davis

Lakeview Dashboard Template Series — Article 1 — DBSQL Warehouse Advisor

Intro

Lakeview Dashboards are a new light dash-boarding tool that come out of the box on Databricks SQL. Having just gone GA, they already have so much capability as a tool to quickly develop and share dashboards to do some powerful things with no additional cost or management overhead. For example, one of the most critical aspects that nearly all customers need is observability of their data platform. Because nearly all customers have similar needs for monitoring and observability for certain use cases, I decided to publish some of the Lakeview templates for anyone to import and use in their own environment to save you the work and help show what Lakeview Dashboards can do! In this article, I will share and walk through the DBSQL Warehouse Advisor Template. We will briefly review the 4 sections of the dashboard, teach you how to use them, and discuss some of the core questions you can answer with each section.

First, at a high level, lets answer what this dashboard does and focuses on. As the first of many templates to be shared, I want to give context for when and how you would use the template.

The DBSQL Warehouse Advisor Lakeview Dashboard aims to tell users everything they need to know about their data warehouse on DBSQL. It automatically parses and analyzes the warehouse_events, billing, and query_history (system table private preview, API is available now in GA) system tables to help you do the following:

Section 1 — Billing Monitor — This section monitors the system.billing.usage system table to summarize costs over time. It also provides an embedded momentum indicator to show when the warehouse scales too much too fast. Since BI workloads especially can be spiky, a smoothed momentum indicator can help give you a better metric for cost alerting with less false-positive noise. You can then use this query out of the box to set up alerts on it!
Section 2 — Scaling Monitor — This section monitors the system.compute.warehouse_events system table to help you summarize the scaling cost and efficiency of your warehouse in a time frame. It can be hard to forecast and estimate usage of spiky / high concurrency workloads. With this visual, it makes summarizing exactly how much time the warehouse spent at different scaling levels simple and easy to plan for.
Section 3 — Query Performance Analysis — This is a huge section! Using the new (private preview) query_history system table (You can also build it yourself with the query history API), you can summarize the performance of your warehouse as a whole, and even dig into the details to find troublesome queries, bottlenecks, track SLAs, and dig into where warehouse time is spent. This has everything from high-level monitoring down to individual query performance tuning all in one dashboard.
Section 4 — Query Cost / Time Attribution Summary — If you utilize a warehouse to serve end customers whether internal or external, and you need to figure how to charge them on a consumption-based pricing model, you can use this section to allocate warehouse time to your end customers at the query, user, and even query tag level! This is crucial for data products as well as a centralized data team serving many downstream teams.

Now that we have given the high level summary, lets go through the sections. If you want to follow along, all you have to do is download the template, import the dashboard into your environment, and plug in your warehouse Id!* When you import the template, you can see the full dashboard visuals as well as the entire backend queries and parameters driving the dashboard. This template can then be used to alter and extend however you need for your specific use cases.

*Note: If you are not in the system tables private preview for query_history, you can also create the table with the query history API until Public Preview.

Section 1 — Billing Monitor

This is the billing monitor for the DBSQL Warehouse Advisor. This section is simple — put a Warehouse Id into the parameter, select a date range, and trend DBUs and Dollars over time along with a big number summary of the total cost within that time frame. Additionally, there is a Cluster Unit Price parameter to use so you can plug in your own rate if it is different from the list rate of the DBSQL warehouse.

Next, you can also track momentum of usage that is smoothed out over time so that you can avoid over-reacting to spiky usage. Most common in high-concurrency BI workloads, admins often want to monitor and alert when the usage is trending up a certain % in a given time period. You can do that with this momentum indicator visual:

This, combined with the above parameter “High Increase % Barrier” allows you to define a threshold that deems your usage pattern a “high usage increase trend” and highlight that red. You can then use the exact underlying SQL query to set up an alert for cost trends by simply adding a LIMIT 1 on the query of the most recent value. This indicator chart says “Show me when the 24-hour rolling moving average increase more than 20% from the previous period.” This is great for sending out alerts for costs increases that are more durable and serious than the occasional spike.

Section 2 — Warehouse Concurrency Scaling Summary

This next section helps make it easier to interpret your warehouse concurrent usage, as the chart in the “Monitoring” tab in DBSQL can get pretty spiky and be hard to plan with. Especially when using Serverless SQL, it can be hard to take this chart…

… and turn it into a budget and expectations for how much this workload / warehouse is going to cost over the long term. More-so, it can be hard to perform scenario analysis to see what happens when you change cluster sizes, concurrency settings, query performance, etc. With this tab, you can create more clear forecasts and expectations for usage.

DBSQL Warehouse Advisor — Scaling Trends

This dashboard automatically takes that tough-to-interpret monitoring tab, and pivots it into a visual that shows you more clearly how often it needs to scale to N-number of concurrent clusters in a warehouse. Zooming in a bit more, we can see exactly how many minutes, and what % proportion of total time during a set time period the warehouse spent scaled up to N-clusters.

We can clearly see that the warehouse needs 1 cluster about 80% of the time, 2 clusters about 13% of the time, 3 clusters ~7% of the time, and 4 clusters <1 % of the time. This makes it easier to not only forecast what this spiky workload will cost over time, but also to plan for and interpret changes to usage patterns. This information is crucial for planning for any kind of production workload on DBSQL.

If you optimize a workload, run a POC, or add a use case, you can use this dashboard to see exactly how it affects your usage patterns so that you can be in control and plan ahead. This is a small section but immensely useful.

Section 3 — Query Performance Analysis

This is the largest section and has everything from high-level performance metrics all the way to down query-level details on runtime, disk spill, data reads, etc. All the details of this section could (and will) be an entirely separate blog, so in this article we will focus on the high-level metrics it contains as well as a few common reasons you might use this section.

In this section, we are primarily concerned with query runtime and queue time. We will first look at these metrics at a very high level as a single big number summary, and then dig into the trends along with SLA charts like so:

Query Performance Analysis Parameter Selection Box

We can see how the query runtime and queue time trend, so now its easy to see when the warehouse becomes more or less strained to quickly find issue; however, it’s not enough to simply take the average runtime for real workloads, you also need to track your query time based on your SLA, which requires calculating more standard metrics such as P90, P95, and P99 measures. This dashboard calculates and shows these metrics out of the box for you. With the above visuals, you can track the average query runtime over time to find exact moments in time when query performance issues arise (see the big spike in the middle? Thats probably an anomaly we should look at). You can do the same thing with query queue time to track when the warehouse is saturated and may need more concurrency. In reality, you need to use both of these in tandem to make sure you size your clusters to the SLA requirements for your use case.

For example, lets say you have a Downstream data application powered by Databricks SQL. Users expect their app to perform most queries in 1–2 seconds or less, and they will not tolerate anything longer than 10 seconds. You can first check your SLA compliance by looking at the percentile metrics for P99 and P95 of the query runtime chart:

This shows us that we are looking great for our SLA adherence. Most of our queries (P95) run in under 1 second. And our very worst queries run in about 5, which is still under what our end users will accept. When these numbers get above acceptable levels, then we can begin to look at optimizing our queries or checking the saturation of the warehouse with the queue time visual:

Now we are within the required SLA, but what if our SLA was 3 seconds for P99? We would be in trouble. The first thing we should do is look at our query queue time in the dashboard above. Queue time is the amount seconds each query spends queued up waiting for an available slot to run. Queueing is fine if we are hitting our SLAs, but if we aren’t, we are wasting precious time and we need to add more concurrency! This visual is interpreted differently then the average query runtime, because we generally always want queuing to be low (~0 seconds). If we get much higher than that (>1 second), then our queries are beginning to pile up and sit in the queue, which impacts our SLA. In our example with the updated 3 second SLA, our P99 is currently at 5 seconds, which is not acceptable. In the above visual, we see that many of these queries can spend almost 2 seconds in the queue! This shows that increasing concurrent capacity can help us get closer to our SLA (min max clusters in DBSQL).

Remember — always start with your desired SLA and work your way backwards.

In addition to the high level performance metrics, this dashboard has many more helpful metrics!

We can trend query performance by status, source, and statement type over time, and even monitor average query runtime by query tag. This is an embedded functionality in the Dashboard that looks for templated comments with the following signature:

/*QUERY_TAG: <insert_comment>*/

This can help focus the entire performance analysis by your business specific use case, which is especially important in ETL workloads.

You can dive even deeper. With the “Query Performance Detail” sub-section, you can identify common issues and bottlenecks in queries. In the dashboard, we automatically calculate flags to help you identify queries that are long running (you can define what “long running means for your context”), queries with spill, and queries that are reading unusually high data volumes.

Further down, you can begin to calculate attributed cost per query, group them by query tags (defined by default above, but you can change this template to fit anything you need!), and even see a full output of the query-level detail to look into root causes of performance issues. In the dashboard, you can also see attributed warehouse costs by user, compute usage % by user, and query count by user.

This is just the out-of-the box starting point, so feel free to import, explore, and extend this dashboard to do what you need for your use cases!

Section 4 — Query Cost Attribution Summary

This section is a common question we get from customers who have external data products that they want query-level cost visibility into to develop sensible pricing models for their data products.

Query Time/Cost Allocation Pareto Analysis

In this chart, the query attribution has already been done for you. You can define your warehouse_id, pick a query range, and define the “Pareto point” you want to see. What this does is allow you to see the queries that make up the top X% of your warehouse total compute time. For example, the above visual shows a SQL query took up about 20% of the warehouse compute time during this time window. The pareto analysis allows you to look at the queries that drive the largest portion of your warehouse utilization, and filtering out the “long tail” of queries that are smaller and not as differentiated.

Not only is there a summary, but the Dashboard also has a query-level report detail to do granular pricing for your data products:

In this report, you can easily see where time is going, and even break it down by end-user and tag! The User Run breakdown provides an output of each user that runs this query, along with the number of executions and error rates of each. This is all the detail you need to have highly transparent cost attribution for your workloads.

A few important definitions to note:

Warehouse Execution Time = Total Query Runtime of All Queries During a time window

Query Allocated Cost = Query Execution Time / Warehouse Execution Time

DBSQL Warehouse Advisor — DBT Version

In addition to the out-of-the-box DBSQL Advisor template, we also have a DBT-flavored version that understands the DBT-metadata in the queries that are pushed down to DBSQL from DBT. This allows users to get more intel on their DBT models and even individual node ids out of the gate.

Now in sections 3 and 4 in the Query Analysis, you can analyze the performance and query allocation metrics by familiar metadata such as dbt target, profile name, or even node id. This allows you to see exactly where your resources are going for your DBT models.

Queries allocated by time and slived by metadata in DBT

Where is the Warehouse Compute time going in my DBT models?

Query-level detail of query attribution for DBT models

With this dashboard, you can automatically get pretty far in understanding and monitoring your DBT pipelines on DBSQL with Lakeview! From internal analytics, to driving full data products, you have complete control and visibility over your data platform.

Whats Next

This is the first of 4 dashboard templates in this series, with the goal of helping Databricks users leverage system tables and Lakeview Dashboards to their fullest for common and powerful use cases. Not only do we want to give users some great out of the box functionality, but these templates can also serve as a reference for interesting and powerful design patterns for dashboards that you may want to build in the future!

If you’re interested in a deep-dive follow up on the SQL logic that powers the dashboards and the visual/parameter selection, comment and let us know!

Lakeview Templates:

Instructions to use:

Download the JSON file in the above links
Go to the “Dashboards” tab in your Databricks Workspace
Click the arrow in the blue “Create dashbaord” button in the top right corner
Select “Import dashboard from file” and upload the downloaded JSON template
Load the dashboard and plug in your down warehouse id!