The New Databricks SQL Warehouse Advisor Dashboard

Databricks SQL SME

Published in

DBSQL SME Engineering

11 min read3 days ago

DBSQL Warehouse Advisor — Scaling Section

Author: Cody Austin Davis

Developers: Cody Austin Davis, Rob Wesley — Resident Solutions Architect @ Databricks

Introduction

The databricks system.query.history is now available in the system catalog in Databricks. This unlocks some amazing use cases that we have already begun to release dash-boarding tools on before in this previous blog. In this blog, we are going to go highlight the new capabilities in the latest release for this dashboard that any Databricks user with Unity Catalog enabled can download and use today! Before we get into the details, lets summarize some of the updates in this new dashboard release:

The dashboard is split into 4 sections:

General Warehouse Intelligence — Monitor warehouse costs for all warehouses, analyze costs by creator, size, etc, and track warehouse configuration changes.
Warehouse Scaling Monitor — Track how well a given warehouse scales for high concurrency. This is specifically helpful for doing production load tests and capacity planning for budgeting.
Query Performance Deep Dive — Track warehouse SLA performance for query runtimes and queue times, monitor for potential bottlenecks, and pinpoint expensive/troublesome queries that are holding things up.
Query Cost Attribution —Easily allocate the cost of a warehouse during a specific time period to the given queries (and query tags) that ran on the warehouse. This is an excellent simple report for customer chargebacks or even internal use case budgeting.

New DBSQL Warehouse Advisor Release Updates:

Materialized View Layer for Performance & Reusability — In this new release, this dashboard comes with a pre-baked Materialized View layer for the ‘gold tables’. This not only makes your dashboards much more performant (which becomes critical for larger dashboards), but it also makes it possible to re-use and scale the insights from the dashboard logic in other places such as AI/BI Genie Spaces (we will go through this on an upcoming article!). When you model and materialize your gold layers, this allows you to register a Genie space to easily leverage in conjunction with your OOTB Dashboards, this is what AI/BI is all about — create expert-led UXs, and then allowing end users to extend that and ask novel questions as well as share their insights on the same intelligence in one place. Materialized Views are a critical foundational feature to unlock this pattern.
New General DBSQL Warehouses Table (MV) — We added the ability to get more valuable metadata from the warehouses themselves. Now in the dashboard, you can track the size and other configurations of a warehouse, as well as view the full change history of any warehouse in your environment. This makes the intel already available richer and more clear.
Optimizations in Query Cost Attribution — In the query cost allocation section (4), you have the ability to distribute the warehouse cost by a weighted runtime per query to do easy and granular chargebacks or business use case cost tracking. Some customers use DBSQL for very large reports or for data commercialization use cases that require this query-level cost attribution. In this version, we optimized this logic a bit by removing metadata/listing queries from the cost allocation since they do not use the warehouse resources directly.
Simpler & More Granular Warehouse Tracking — Now you can query the dashboard by warehouse_id (previous verison) as well as warehouse_name (the current warehouse name, since a warehouse can be renamed). You can also track usage of deleted warehouses as well as many other valuable data points such as cluster owner, tag, etc.

We hope you find this dashboard useful, and are open to any feedback on features & usability! Now let’s walk through the sections.

Prerequisites

A Databricks environment with access to system tables
The system.query.history system table enabled
Download the following folder or import the DBSQL SME repo
Follow the steps in the README in the above folder (run SQL data model, import dashboard, edit your default parameters, and publish!)

Materialized View Layer

For added performance and reusability in Genie Space for AI/BI (article coming next), we first create a materialized view layer that the dashboard will use to minimize the logic / ETL that is required in the Dashboard itself. For example, we can create our cleaned up query history view table like this:

CREATE SCHEMA IF NOT EXISTS main.dbsql_warehouse_advisor;
USE CATALOG main;
USE SCHEMA dbsql_warehouse_advisor;


CREATE MATERIALIZED VIEW IF NOT EXISTS main.dbsql_warehouse_advisor.warehouse_query_history
COMMENT 'SQL Warehouse Query History with cleaned up exeuction metrics and query tags'
SCHEDULE CRON '0 0 0 * * ? *'
TBLPROPERTIES ('pipelines.autoOptimize.zOrderCols'= 'warehouse_id,start_time')
AS 
(
SELECT
account_id, 
workspace_id, 
statement_id,
executed_by,
statement_text,
compute.warehouse_id AS warehouse_id,
execution_status,
statement_type,
COALESCE(client_application, 'Unknown') AS client_application,
COALESCE(try_divide(total_duration_ms, 1000), 0) AS QueryRuntimeSeconds,
COALESCE(try_divide(total_task_duration_ms, 1000), 0) AS CPUTotalExecutionTime,
COALESCE(try_divide(execution_duration_ms, 1000), 0) AS ExecutionQueryTime, -- Included in Cost Per Query
COALESCE(try_divide(compilation_duration_ms, 1000), 0) AS CompilationQueryTime, -- Included in Cost Per Query
COALESCE(try_divide(waiting_at_capacity_duration_ms, 1000), 0) AS QueueQueryTime,
COALESCE(try_divide(waiting_for_compute_duration_ms, 1000), 0) AS StartUpQueryTime,
COALESCE(try_divide(result_fetch_duration_ms, 1000), 0) AS ResultFetchTime,
-- Metric for query cost allocation - -- exclude metadata operations
CASE WHEN COALESCE(try_divide(total_task_duration_ms, 1000),0) = 0 
    THEN 0 
    ELSE COALESCE(try_divide(total_duration_ms, 1000), 0)  + COALESCE(try_divide(compilation_duration_ms, 1000), 0) -- Query total time is compile time + execution time
    END AS TotalResourceTimeUsedForAllocation,
start_time,
end_time,
update_time,
COALESCE(read_bytes, 0) AS read_bytes,
COALESCE(read_io_cache_percent, 0) AS read_io_cache_percent,
from_result_cache,
COALESCE(spilled_local_bytes, 0) AS spilled_local_bytes,
COALESCE(total_task_duration_ms / total_duration_ms, NULL) AS TotalCPUTime_To_Execution_Time_Ratio, --execution time does seem to vary across query type, using total time to standardize
COALESCE(waiting_at_capacity_duration_ms / total_duration_ms, 0) AS ProportionQueueTime,
AVG(try_divide(total_duration_ms, 1000))  OVER () AS WarehouseAvgQueryRuntime,
AVG(try_divide(waiting_at_capacity_duration_ms, 1000)) OVER () AS WarehouseAvgQueueTime,
AVG(COALESCE(try_divide(waiting_at_capacity_duration_ms, 1000) / try_divide(total_duration_ms, 1000), 0)) OVER () AS WarehouseAvgProportionTimeQueueing,
-- Can use this to chargeback (as long as you know denominator is only USED task time, not including idele time)
CASE WHEN read_bytes > 0 THEN try_divide(read_bytes,(1024*1024*1024))ELSE 0 END AS ReadDataAmountInGB,
CASE WHEN read_io_cache_percent > 0 THEN 'Used Cache' ELSE 'No Cache' END AS UsedCacheFlag,
CASE WHEN spilled_local_bytes > 0 THEN 'Spilled Data To Disk' ELSE 'No Spill' END AS HasSpillFlag,
CASE WHEN read_bytes > 0 THEN 'Did Read Data' ELSE 'No Data Read' END AS ReadDataFlag,
CASE WHEN CPUTotalExecutionTime > 0 THEN 'UsedWorkerTasks' ELSE 'NoWorkers' END AS UsedWorkerTasksFlag,

--CASE WHEN QueryRuntimeSeconds >= :long_running_seconds_treshold::float THEN 'Long Running' ELSE 'Short Running' END AS LongRunningQueryFlag,
CASE WHEN spilled_local_bytes > 0 THEN 1 ELSE 0 END AS Calc_HasSpillFlag,
CASE WHEN read_bytes > 0 THEN 0.25 ELSE 0 END AS Calc_ReadDataFlag,
CASE WHEN CPUTotalExecutionTime > 0 THEN 0.25 ELSE 0 END AS Calc_UsedWorkerTasksFlag,
--CASE WHEN QueryRuntimeSeconds >= :long_running_seconds_treshold::float THEN 1 ELSE 0 END AS Calc_LongRunningQueryFlag,
-- Query Tagging 
regexp_replace(
regexp_extract(
statement_text, 
r'/\*.*?QUERY_TAG:(.*?)(?=\*/)',
1
), 
'QUERY_TAG:',
''
) AS raw_tagged
FROM system.query.history
WHERE compute.warehouse_id IS NOT NULL -- Only SQL Warehouse Compute
AND statement_type IS NOT NULL
);

In this view, we do quite a bit of cleaning and feature additions (such as query tag parsing from comments). If this were to only be a query in the dashboard itself, the dashboard would have to re-run the same logic each time it is queried — that is a waste and exactly what materialized views are here for! For example, system.query.history now contains more than just DBSQL warehouse queries. For our dashboard, we can improve performance by creating this MV (materialized view) to exclude the data not relevant for our dashboard once a day instead of on every single query. This is a best practice in building AI/BI Dashboards (or any dashboard really).

In the provided data model SQL script, there are more Materialized Views created. These are the following Materialized Views the dashboard will now use:

1. main.dbsql_warehouse_advisor.warehouse_query_history - Cleaned up query history
2. main.dbsql_warehouse_advisor.warehouse_usage - Warehouse billing + usage
3. main.dbsql_warehouse_advisor.warehouse_scaling_events - Warehouse concurrency scaling history
4. main.dbsql_warehouse_advisor.warehouse_scd - Warehouse edits/changes over time
5. main.dbsql_warehouse_advisor.warehouse_current - Current snapshot of state of warehouse (including deleted flag)

Section 1 — General Warehouse Intelligence

In our old dashboard version, we could only analyze costs and usage for 1 warehouse at a time and search by warehouse_id. Now in our new version, we can analyze all our warehouses in a single window, track costs, and track edit history for our warehouses. Check out the new view below:

We now have much more flexibility in our analysis, and can now filter by warehouse name , warehouse creator, size, deletion status, and even a top N filter to look at usage for your top N warehouse for that period.

Drilling in further, we can also analyze total warehouse usage by warehouse owner, and warehouse name in below visuals:

Lastly, you can monitor warehouse changes over time directly in the dashboard like so:

Overall, users have lots more granularity and intelligence in this new version compared to the original, which just had warehouse_id and could only analyze 1 warehouse at a time.

Section 2— Warehouse Scaling History

In our previous version of the warehouse scaling analysis section, we could analyze the scaling pattern of the warehouse, but did not have the ability to see the configurations and name of the warehouse. Now we do! With this update, we can now build more complete costs analyses, load test modeling, and efficiency analysis.

Scaling History with current min/max boundaries

In this updated section, we now include auto_stop time, min_clusters (min cluster that the warehouse can scale to), max_clusters (max clusters the warehouse can scale to for concurrency), as well the the size. These new metrics, combined with the existing scaling tab, make it quite simple to build cost forecasts from production load tests / runs.

Section 3 — Query Performance Analysis

The query performance analysis section allows users to easily monitor warehouse performance such as SLAs via P90,95,99 query runtime and queue times, as well as a deep-dive section to find poorly-performant or expensive queries.

In the initial section, we can first monitor performance for a given warehouse_id or name:

We can see current warehouse configuration (size, etc.) as well as monitor not just the average runtime for a query over time, but also track the SLA — P95,99, etc. We can do the same thing with queue time. For example, we can quickly point out a huge spike in average runtime that we can double-click into to investigate. More interestingly, we can begin to use these same dashboard queries as the basis for alerts to act proactively.

In the rest of section 3, we can dive deep into metrics such as runtime by query_tag (templated comment in the materialized view), statement_type, status, and even source app. We can also filter by all these dimensions.

We can also identify expensive queries that may be bottlenecking the warehouse performance. In the dashboard, we create flags that help make is easy to identify problem-queries by creating flags that show if there was spill, a high volume of data being read, or the query just ran too long. Additionally, we can track the caching patterns of the warehouse overtime.

Lastly, in our query performance deep dive, we can allocate the cost of a query by distributing the usage of the warehouse in a given time period by the total share of resources the query took from the warehouse. In this deep dive, you can analyze costs by query tag and dig into specific queries to identify bottlenecks and go into the deepest level of detail (is it scanning too much data? Queueing and never running? Spilling data? etc.).

Section 4 — Query Cost Attribution

Lastly, this section is a much more simplified section aimed to do 1 thing: tell you where you warehouse spend should be allocated by query/use case (tag).

Here we see that the user can choose a warehouse and track the costs of unique queries to monitor not just costs per query, but costs for queries and query tags that are run repeatedly down to the user and runtime. On the backend, the materialized views will create a ‘query signature’ (a hash of the query text with the comments taken out) and then distribute the total cost of a warehouse during a period of time by the total CPU hours each query utilized.

The first part of the section does a simple pareto analysis. You can specify what % of the top usage you want to look at, and it will group the ‘long tail’ of the rest of the queries into a single bucket. This helps do a more concentrated Top-N analysis. For example, above we see that the most expensive query took up 8.46% of the used warehouse time.

We can then use the next section to get a crystal clear view of where our warehouse spend is going:

Section 4 — Detailed query cost attribution

Here we can allocate warehouse spend by query tag (embedded in the data model functionality by parsing templated comments /*QUERY_TAG: <tag_info>*/) and generate a simple report like the table in the above screenshot. We can see the query text (representative query of our unique query signature), and see the direct allocated dollars, % of total costs, as well as understand the cost per run and user-level breakdown. You can quickly see where the costs are going, and easily identify usage down to the execution level. This kind of report is easily automated for chargebacks.

Conclusion

Now that the query.history table is available more broadly, give this dashboard a try and feel free to change it for your own use cases! We hope we have shown some good best practices for building dashboards as well as given you a concrete tool that can help you get up and running as easy as possible.

Coming up next, we will take this data model that the dashboard is now built on with materialized views, and build a comprehensive Genie Data room to help ops personas self-serve their own questions and business needs with these rich datasets.

NOTE: This dashboard is not an official databricks product offering. It is an open-source dashboard template built by the DBSQL SME group members. For changes, bugs, or feature requests, comment to this article or message us directly.