Tune Query Performance in Databricks SQL with the Query Profile

Databricks SQL SME
DBSQL SME Engineering
15 min readNov 8, 2023
Author Screenshot: DBSQL Query Profile Example

Author: Cody Austin Davis

Intro: Gone are the days where Spark and Databricks are a big data processing engine only. 2–3 years ago, you had to be an expert in Spark and Big Data processing to be able to run any high-level SQL queries on Spark, even on Databricks. Today, that is no longer the case. With the release of Databricks’ new execution engine Photon, and the worlds first Serverless Lakehouse — DBSQL, users can optimize and tune queries in seconds without needing to ever know what a Spark UI is. In this article, we will walk through the Databricks SQL Query Profile. We will describe what it is, some of the specific benefits it offers for ad-hoc users and production systems, and also show some concrete example of how to use it. More specifically we will cover the following:

  1. Basic Query Profile Use Case — File Pruning a Selective Query
  2. Looking at More Complex Queries
  3. Optimizing Tables with Delta Liquid Clustering for Simpler Performance OOTB
  4. Identifying Bugs in SQL code using the Profile

What is the Query Profile in DBSQL?

The Databricks SQL Product one of about 3 product “arenas” in Databricks. Since Databricks is organized by persona, all Databricks SQL related features are located under the “SQL” tab for the analyst persona. Within that tab, you will find all your queries, dashboards, DBSQL alerts, and then your Query History. The “Query History” tab is where you will find the history for all queries that are run in the workspace that you as a user have access to.

Finding the Query Profiles under Query History in a Databricks Workspace

Additionally, we can also go to the “Monitoring” tab in a specific DBSQL Warehouse as shown here:

Clicking on the Monitoring Tab of a DBSQL Warehouse

Scrolling to the bottom of the page will take us to the Query History for a particular DBSQL Warehouse. This is particularly nice in production systems when you want to isolate warehouses per different systems, environments, or sets of use cases. This set up makes it really easy to see what is running and where.

Query History for a specific DBSQL Warehouse

No matter which way you get there, now you can easily click on a query and check out its profile! To show some of the easy tips and tricks to recognize bottlenecks and tune queries, we will run a few queries on the common TPC-DI dataset. We will walk through tuning a few specific examples without needing to even look at a Spark UI. We will first start on a completely un-optimized data model loaded in Databricks SQL (No tuning at all), and then use the Query Profile to quickly and easily identify bottlenecks, and pinpoint what is happening to help guide our performance tuning decisions. The Query Profile has just about every metric that a user could pull from the Spark UI, the difference is that the profile allows a user to start simple, and only access the details if they need/want them. This makes the profile powerful for new users who are just getting started, but also has all the information DBA professionals would want to tune production queries.

Lets start with a simple query:

--===== QUERY 1: Point selection -- look for specific trade id

SELECT
h.currenttradeid,
h.currentprice,
h.currentholding,
h.currentholding*h.currentprice AS CurrentMarketValue,
c.lastname,
c.firstname,
c.status
FROM main.tpc_edw_demo.factholdings h
INNER JOIN main.tpc_edw_demo.dimcustomer c ON c.sk_customerid = h.sk_customerid
INNER JOIN main.tpc_edw_demo.dimcompany comp ON comp.sk_companyid = h.sk_companyid
WHERE h.currenttradeid = 527764963
AND c.status = 'Active'

In this example, we select from a decently large fact table, factholdings, do a few joins, and select a specific currenttradeid where the status is active. This is a classic highly selective query that are often seen in Data Applications used for operational use cases. Lets take a look at the Query Profile this generates for us, it comes in 2 sections: graphical and metrics. Let’s pick apart one section at a time, they both give us some pretty great info that we should always start with when tuning a query.

Graphical View:

Graphical View of the Query Profile

The Graphical View shows us the visual order-of-operations that the query goes through to build the result you ask for. We generally read these from the bottom up, but the Query Profile makes identifying bottlenecks much easier by visually highlighting time-spent and/or rows in blue. If you are zoomed in close, the blue will look like the blue bars above. If you are zoomed out, they start as a heat map of a query that immediately shows where where they query is spending all its time. So lets cover a few key things that we should be able to spot quickly in the UI above:

  1. Time spent per node: This almost always the starting point when tuning a query in the profile. From there, we can quickly identify the bottleneck, and begin to think through causes. In our example above, the longest running node in the query is the scan node for the table “factholdings”. The scan node for dimcompany was pretty long as well.
  2. Rows output at each node: Now look at the output in rows for each node. The dimcustomer table outputs about 20 million records, yet takes less time to process than scanning the factholdings table. What is most interesting is that our bottleneck, factholdings, only outputs 1 row. This is our first major clue that we identified in seconds with the Query Profile.

Now we have some basic information, and at this point we already know where to look. Next, we can dive into the 2nd part of the Query Profile: the Metrics.

Metrics View of Query Profile

Shown above as the metrics section of our initial query, we have some great starting info here. Some key easy metrics we can look at are Total Wall Clock Duration, Rows/Bytes Read (how much data are we dealing with? key question), Bytes written, Files/Partitions read, and more. These are the starting points and can generally help you tune most queries. Back to our above example, we noticed a key thing just from the Graph: our bottleneck is the same node that returns the least rows. Lets note that this query takes about 2.8 seconds to run. This is a classic needle in a haystack problem. Generally, this means that the Query Engine is reading way more files than it actually needs for the results. Solving this in DBSQL and Delta is simple, but lets first confirm that this is the problem by going one level deeper. Lets click on our bottleneck “scan factholdings” node we identified above.

Author Screenshot: Clicking into a node in the Query Profile

Now that we have clicked on our problem node, the Query Profile pulls up a bunch of helpful metrics. The only metrics we really care about here are Files Pruned, Files Read, Size of Files Read/Pruned, and Number of Output Rows. This tells us just about the whole story of what is going on here. Note, in this scan, we are reading 448 files and pruning 0 (0% file skipping). We are also reading 50 GBs worth of data simply to get a single row. This is our problem. The answer here is that we need to tell our tables how are data is organized, so it can skip files that it really does not need (in our case, we need 1 row out of 50GB of data). We definitely only need one file.

The solution is a ZORDER index, or running CLUSTER BY (Databricks new Liquid Clustering Feature) on the columns we are filtering on in the query the most. In our case, our query was filtering on currenttradeid, and that filter was very selective given that we only needed 1 record. So we can run :

OPTIMIZE main.tpc_edw_demo.factholdings ZORDER BY (currenttradeid);

or for Liquid Clustering Delta Tables:

ALTER TABLE main.tpc_edw_demo.factholdings CLUSTER BY (currenttradeid);
-- Need to run optimize to cluster table if data is new or larger than
/*
Clustering on write is a best effort application, and is not applied in the following situations:

If a write operation exceeds 512GB of data.

If the SELECT subquery contains a transformation, filter, or join.

If projected columns are not the same as the source table.

*/
OPTIMIZE main.tpc_edw_demo.factholdings;

IMPORTANT NOTE: above liquid clustering. Defining the cluster keys are declarative on table DDL now, this is different from ZORDER. Liquid clustering will “cluster on write” automatically for the above operations, but for complex operations or larger inserts, you will run OPTIMIZE <table_name> to cluster the table by the declaratively defined keys.

Now lets run the query again see what our query profile shows:

Optimized Query Profile Screenshot

We went from above 3 seconds to a little over 1 second for this highly selective query, and we are able to identify the bottleneck quickly in via the Query Profile! Not bad for 1–2 minutes of work. As queries get more complex in the system, this Query Profile becomes all the more important for tuning complex queries quickly and easily.

There are a ton of great features in the Query Profile, so lets look at a more interesting example. In this example, we will look at using Liquid Clustering for simpler table optimization and show how to use the Profile to easily identified bugs cause performance issues.

Lets say we have 3 tables: factholdings, dimcustomer, and dimcompany. Dim customer is an SCD Type 2 table, meaning it has updates that are “soft deleted” and expired over time in the table. These 3 tables represent the holdings that a particular customer has in a company over time. Lets now say we are are an analytics team for a fund manager that needs a report of the top 100 top active customers based on holdings. Keeping the nuances of finance out of the problem, lets devise a basic SQL statement to build this report:


--- Top 100 Largest Active Customer Base
WITH year_selected_holding AS (

SELECT
h.tradeid,
h.currentprice,
h.currentholding,
h.currentholding*h.currentprice AS CurrentMarketValue,
c.lastname,
c.firstname,
c.status,
comp.name AS company_name,
to_date(sk_dateid::string, "yyyyMMdd") AS Date
FROM main.tpc_edw_demo.factholdings h
INNER JOIN main.tpc_edw_demo.dimcustomer c ON c.sk_customerid = h.sk_customerid
AND c.status = 'Active'
AND c.enddate::date >= now()::date --SCD 2 table, ensure customer record is not expired
INNER JOIN main.tpc_edw_demo.dimcompany comp ON comp.sk_companyid = h.sk_companyid
WHERE h.sk_dateid BETWEEN 20150101 AND 20151201
)
,
holding_customer_agg AS (

SELECT
CONCAT(lastname, ', ', firstname) AS CustomerName,
SUM(CurrentMarketValue) AS TotalHoldingsValue
FROM year_selected_holding
GROUP BY CONCAT(lastname, ', ', firstname)
),
customer_rank AS (

SELECT
*,
DENSE_RANK() OVER (ORDER BY TotalHoldingsValue DESC) AS CustomerRank
FROM holding_customer_agg
)
SELECT * FROM customer_rank ORDER BY CustomerRank LIMIT 100
;

The more complex the queries get, the more the Query Profile really comes in handy, as troubleshooting nested, multi-stepped SQL pipelines in the Spark UI is tenuous and difficult to get dissect, so lets optimize this one with the Query Profile!

When running this query over a completely un-optimized set of tables (default CTAS from files, no indexing/clustering/etc.), we get the following output on a Medium DBSQL Cluster:

Raw unoptimized Top 100 Query

Ok, we get about 14 seconds. In the real world, we would not be just optimizing for fun, we would generally optimize for cost given an SLA performance requirement; however, since we have no SLA here, lets just assume 14 seconds is not fast enough. To start, we need to ask a few questions:

  1. How much data are we dealing with?
  2. Where is the bottleneck?
  3. What columns are used most in filters and joins?
  4. Do those columns filter a good proportion of the data, or do they always pull in most rows?

There are plenty more optimizations questions we can ask ourselves, but most of the others are pointless until we need them and we have exhausted all options with the 4 questions above, so lets answer these with a Query Profile and come up with a way to make this query better!

  1. How much data are we dealing with?

Looks like about 23 GBs and 250 million rows. Not small, but this cluster can handle the size. Pro tip, if you see the bottom section called “Spilling” have anything more than 0 bytes, then that usually means you would benefit from increasing your cluster size. But it could also mean you are writing the query inefficiently, so check your Query Profile Graph for the other metrics that make sense with your top-level data size — 23 GB/250M rows in our case. Our Medium cluster is handling this perfectly fine. We could even try shrinking it if cost was top priority.

2. Where is the bottleneck? Lets find the biggest bottleneck, fix it, and see what that does to the profile.

Graph View of the Top 100 High Value Customers Query Profile

From our Graph View in the Profile, this question is easy! We are spending a lot of time scanning that factholdings table. We had handled this table before, and we added a ZORDER index / CLUSTER key of tradeid. What gives, why isn’t it fast then? Well lets click into the node in our profile and look at our file skipping metrics:

We optimized on tradeid, but in this query, we use a different column for joins and filters:

WHERE h.sk_dateid BETWEEN 20150101 AND 20151201

So lets change out strategy for table factholdings with the updated ZORDER / CLUSTER BY command:

OPTIMIZE main.tpc_edw_demo.factholdings ZORDER BY (sk_dateid, currenttradeid);
--OR
ALTER TABLE main.tpc_edw_demo.factholdings CLUSTER BY (sk_dateid, currenttradeid);

-- Run after certain table operations incrementally
OPTIMIZE main.tpc_edw_demo.factholdings;

Now lets pause here in our checklist and re-run the query after this optimization:

~6 seconds, we have some improvement!

Top 100 query with optimized factholdings by 2 columns

Previously, during the scanning/reading of the factholdings table, the total cpu time was over 5 mins (not wall-clock runtime, but total CPU time). Now looking at the same graph post optimization, we see a drastic reduction of time spent reading the table, and we see files being skipped! Given that this is the largest core table, this is our first key big optimization step.

Top 100 Query — factholdings Scan Node optimized

Now that we have resolved that bottleneck, lets take a look at the next stage.

3. What columns are used most in filters and joins?

We already found one important column in the WHERE clause (sk_dateid), but there are joins involved, and joins are expensive. We also need to join on our dimcustomer and dimcompany tables using customerid and companyid. Since dimcustomer is our bigger table and has more join conditions, lets focus on that. Being an SCD2 style table, the optimization of this table is slightly different. As we see in our join above, we join on customerid with the addition conditions of status = ‘Active’ and enddate ≥ now()::date. This is a great use case to do more file pruning! We can see this filtering happen in the scan node of the customer table in the Query Profile in the screenshot below. The problem again is that it has to scan the whole table and these filters have to filter the data after they are already read. Before we go just adding more ZORDER columns, lets think about the next question first.

Top 100 Quer y— dimcustomer scan node

4. Do those columns filter a good proportion of the data, or do they always pull in most rows?

In the Query Profile, If we click on the Scan node for dimcustomer, we see that the table outputs ~22 million rows, and prunes 0 files. Dimcompany has 5 million rows and 0 pruned files. Not pruning files isnt always bad, especially in this case since scanning both tables takes well under 1 second of cpu time, but it can quickly add time and complexity as queries become more complex and as they are run more than once over time. In the join to dimcustomer, our query really only needs the ‘active’ customers with records that are not expired. This is a very common performance tuning requirement for SCD2 warehouse tables. Before we solve for this, lets ask how selective this would be. Our total customer table is about 22 million rows. And we can see above in our profile that the output to the table AFTER the filter is only 13.1 million rows. That is about a 40% reduction if we are able to skip reading those records entirely. In addition, most SCD2 that are very update heavy would have a much higher ratio, and optimizing their SCD2 table could offer over 90+% file pruning improvement.

So what can we do about this? Well this table is interesting because we join on customerid, which is a very high cardinality column, but we also filter on status and enddate, which are very low cardinality columns. Users could partition the table by status, and ZORDER by customerid / enddate or some variation of that strategy, but there are some problems with that approach.

  1. Enddate is not very high cardinality, so ZORDERing on it might be overkill (could work, but not really necessary).
  2. If this table isn’t huge (>1 TB at least), so partitioning it seems unnecessary.
  3. If we ZORDER or partition this way, our table could be very skewed, and that could have some moderate performance impacts down the road.

This is the perfect use case for Delta Liquid Clustering. So lets cluster our Delta table. This CLUSTER BY command is how we can cluster delta tables with Liquid Clustering. It is declarative in the table DDL rather than a separate operation like ZORDER, so now you can track the clustered columns in the table history and evolve them over time. Instead of having to worry about when to PARTITION vs ZORDER, we do both in one command: CLUSTER BY.

CREATE OR REPLACE TABLE main.tpc_edw_demo.customer_liquid
(sk_customerid BIGINT, customerid BIGINT, taxid STRING, status STRING, lastname STRING, firstname STRING, middleinitial STRING, gender STRING, tier TINYINT, dob DATE, addressline1 STRING, addressline2 STRING, postalcode STRING, city STRING, stateprov STRING, country STRING, phone1 STRING, phone2 STRING, phone3 STRING, email1 STRING, email2 STRING, nationaltaxratedesc STRING, nationaltaxrate FLOAT, localtaxratedesc STRING, localtaxrate FLOAT, agencyid STRING, creditrating INT, networth INT, marketingnameplate STRING, iscurrent BOOLEAN, batchid INT, effectivedate DATE, enddate DATE)
TBLPROPERTIES ('delta.targetFileSize'='64mb')
CLUSTER BY (status, enddate, customerid);

We can now insert data. With Liquid, most INSERTS will automatically cluster the data when it writes, which removes the need for async optimize commands. There are some exception for larger tables (>512 GB per insert) and streams for good reason documented here.

Now when we run our top 100 query, we see files being pruned for the SCD2 table!

Liquid Clustering Top 100 SCD2 Table

One last key thing to utilize the query profile is in debugging queries, which can be extremely hard with larger complex queries. So lets change our above statement by one letter to make the customer table join key use the wrong alias like so:

--- Top 100 Largest Active Customer Base
WITH year_selected_holding AS (

SELECT
h.tradeid,
h.currentprice,
h.currentholding,
h.currentholding*h.currentprice AS CurrentMarketValue,
c.lastname,
c.firstname,
c.status,
comp.name AS company_name,
to_date(sk_dateid::string, "yyyyMMdd") AS Date
FROM main.tpc_edw_demo.factholdings h
INNER JOIN main.tpc_edw_demo.dimcustomer c ON h.sk_customerid = h.sk_customerid
AND c.status = 'Active'
AND c.enddate::date >= now()::date --SCD 2 table, ensure customer record is not expired
INNER JOIN main.tpc_edw_demo.dimcompany comp ON comp.sk_companyid = h.sk_companyid
WHERE h.sk_dateid BETWEEN 20150101 AND 20151201

Did you catch it? A very easy mistake. Without the query profile, this could be tedious and annoying to debug. But lets see how to find it quickly in our profile:

Exploding Query Plans — Debugging

Here we see that the query goes from 5–7 seconds to over 31 seconds because of one letter. We can quickly see in the Profile where the arrows / rows explode (2+billion rows) and see that it happens on the join to the customers table, and the rows output is MUCH more than any of our tables. If youre seeing output rows being generated that are more than the rows of the tables, that is a red flag! This allows us to find that needle in the haystack quickly!

Finally, For the more advanced users, below are some key features in Delta that we can use to optimize query performance:

  1. delta.dataSkippingStatsColumns — Previously Delta allowed users to specify how many columns Delta used to track statistics with the TBLPROPERTIES value: delta.dataSkippingNumIndexCols. Now, this updated feature allows users to specifically customize which columns to track statistics on, making stats collection much more simple and efficient.
  2. delta.delta.tuneFileSizesForRewrites — Delta table Property that you can enable if you have very update-heavy tables, much like the SCD2 customer table example above. This is incredibly helpful for automatically tuning file sizes when there are lots of merges on your tables.
  3. In the Query Profile, you can also enable “Verbose” mode. This gives you access to ALL query metrics if you want to dive deep into the weeds on cpu execution time, cloud request times, data filtering, scanning, and much more! Just click the […] button at the top right hand side of the query profile.

Hopefully you found this article helpful, please reach out or comment if there are any other performance tuning and Query Profiling tips and tricks you want to learn more about!

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL