Having your cake and eating it, too: Best Performance, User Experience, and TCO on Databricks Serverless SQL

Published in

DBSQL SME Engineering

12 min readMar 25, 2024

*Charting the average cost of executing the benchmark end-to-end*

Author: Shannon Barrow, Lead Solutions Architect @ Databricks

With three different Databricks SQL SKUs, a common refrain from Databricks customers revolves around “which” is the suitable SKU for them. On the surface, the cheapest SKU (Classic) appears to be the best TCO option. However, Serverless performance and features have grown tremendously over the last year, and mistaking the SKU price as a reflection of value and TCO could prove costly. Serverless delivers cheaper costs for most SQL workloads by offering:

Best-in-class performance against any cloud data warehouse.
Better performance leads to fewer provisioned clusters.
Less idle time leads to reduced over-provisioning.

Before reviewing these three drivers individually, one obvious advantage not included above is that…

Serverless includes compute costs!

Serverless is the only SQL SKU that includes cloud instance costs, denoted by a tiny footnote on the Databricks pricing page. Taking the list price at face value can mislead some customers without additional scrutiny. The reasons why Classic/Pro SKUs do not include the total warehouse costs are understandable; after all, Databricks does not control highly variable cloud compute instance pricing. The chart below levels the playing field for the three SQL SKUs to give customers a better pricing context¹.

Serverless is now cheaper than Pro and now only 21% more expensive than Classic!

Serverless is now cheaper than Pro and only 21% more expensive than Classic!

If Serverless is more affordable than Pro, for whom is the Pro SKU designed? The answer is simple: for those who cannot use Serverless. Unfortunately, not all customers have access to Serverless and generally fall into 1 of 2 groups:

Serverless is unavailable in the customer’s cloud or region. Even though over 97% of all Databricks customers are in a region where Serverless is enabled, 3% of customers still need SQL options.
Some customers must manage their own cloud infrastructure (rare). Some security or regulatory requirements can preclude a serverless solution.

Pro remains an excellent offering with improved features and performance over the Classic SKU. However, Serverless will always have better TCO and performance than Pro. Therefore, the primary emphasis of this blog is on the performance and TCO of Classic vs. Serverless SKUs, with Pro metrics included for reference. One way to gain a favorable TCO position over a 21% cheaper SKU is with improved performance, and Serverless performance is in a class of its own!

Serverless SQL Offers Best-in-Class Performance

Near-instant startup and rapid autoscaling are widely known features of Serverless SQL. However, this blog’s analysis begins with an often-overlooked Serverless advantage: premium performance. Serverless innovations over the last year, including faster I/O, increased Photon coverage, better caching, newer generation machines, and more, have turned Serverless SQL into a best-in-class Data Warehouse.

Determining the performance gap between the SKUs requires some industry-recognized standardized testing. The highly popular TPC-DS benchmark needs no introduction, and the benchmark results reviewed in this blog serve to derive a measurable performance heuristic for the SQL SKUs. Using a pre-loaded 1TB dataset (1000 scale factor), test runs launched all 99 TPC-DS queries in parallel against started/ready single-cluster warehouses for each SKU. These benchmark conditions isolate and control for variables like autoscaling and startup latency (two significant Serverless advantages) to focus on the active performance variations between SKUs. For more details about benchmark assumptions and conditions, please see footnotes².

How does Serverless perform under this “worst case” scenario when removing autoscaling and startup time/costs from consideration? It laps the field.

*Charting total end-to-end run times by SKU. Results only show the best TCO configuration, averaged across both AWS and Azure³*

Serverless is 64% faster than Classic and 21% faster than Pro in the end-to-end run time⁴.

Despite a resounding 64% performance advantage, there is some “noise” in the results caused by “stragglers,” the long tail of running queries, that skew end-to-end run times⁵. Combining run time with two additional metrics, query execution time⁶ and total task time⁷, can provide a more holistic performance heuristic. The chart below includes all three measures to chart the relative performance improvement of Serverless over the Pro and Classic SKUs.

*Charting three important metrics using* *geometric mean*⁸ *with Serverless as baseline.*

Setting Serverless to a baseline of 0, Pro and Classic perform considerably worse — especially in query execution time. Total task time, arguably the most valuable metric here, roughly matches the combined average of the three metrics. This chart reflects Serverless’s growth in the last two years, lapping the Classic SKU in performance. It bears repeating:

Serverless is now double the performance of the Classic SKU.

Serverless performance is still improving with every version release! While benchmarking, a new DBSQL version was released (2024.10) that improved Serverless and Pro performance by roughly 10%. Therefore, all tests were rerun using the latest version to ensure accuracy across all results. How many CDWs can claim they make leaps in performance like this several times yearly?

Despite being listed on the pricing page as over 3x cheaper, Classic has become 26% more costly than Serverless! Serverless now carries every advantage: TCO, performance, and user experience. The premium product has become the most affordable one.

This performance story becomes even more powerful when paired with high-concurrency workloads — those in which a warehouse must support dozens or hundreds of concurrent queries across load-balanced clusters. The message is simple:

Improved performance → Fewer required resources → Lower costs

The following sections include example scenarios to simplify and illustrate query patterns on total costs.

Execute faster queries with fewer clusters on Serverless

The example workload below illustrates consistent, medium-concurrency, high-uptime query patterns typically found in warehouses with many scheduled tasks, like SQL alerts or dbt tasks, that keep the warehouse busy throughout the day.

*Graph of performance-adjusted concurrency strain on the warehouse by SQL SKU and time.* ***LOWER is better!***

In the graph above, Classic (green) and Pro (yellow) are forced to support more concurrent queries than Serverless (blue) because of the performance gap, allowing queries to accumulate and slowing performance even further with additional resource contention. To handle the additional concurrency, Classic warehouses will need to:

Allocate more clusters, which leads to higher costs
Queue queries that exceed the cluster capacity, which further slows performance and only exacerbates the problem.

Dividing concurrency by a fixed baseline for maximum cluster concurrency⁹, the adequate number of clusters for this workload is charted below.

*Required warehouse clusters to serve performance-adjusted concurrency by SQL SKU and time.* ***LOWER is cheaper!***

Two things stand out when viewing the graph above:

Classic and Pro warehouses will run all day with a fixed number of clusters, while Serverless turns off several times¹⁰.
Serverless warehouses scale repeatedly but never exceed the Classic/Pro cluster counts. Meanwhile, Classic and Pro warehouses are better left with fixed cluster counts since the rapid spikes begin and end faster than their capabilities to start new clusters.

Pricing out the above warehouse consumption results in a significant Serverless TCO advantage:

*Pro/Classic demands significantly more compute resources than Serverless — leading to higher costs.*

Classic requires 2.1x more cluster time, leading to 60% more costs than Serverless.

Some cost-conscious customers may weigh the above TCO advantage with other avenues of cloud-provided discounts or reserved instances, noting that discounts have yet to be included in the TCO analysis thus far. In some ways, it is understandable since cloud providers and Databricks offer potential discounts, and this type of TCO analysis is encouraged across many other SKUs. However, the high variability and fluidity of discounts across the cloud providers, Databricks, and various SKUs make it nearly impossible to quantify them and only serve to complicate this type of analysis needlessly. This blog highlights the sheer dominance of Serverless over other SQL SKUs in a way that cloud-provided compute discounts will rarely overcome — at least in any meaningful way to sacrifice better performance and user experience. The 60% Serverless TCO advantage in the above example is too vast to overcome — further strengthening the argument to preclude cloud compute discounts from the scope.

This example scenario is common in only a minority of customers. Rarely do warehouses run all day with such a highly consistent query pattern. In practice, most customer warehouses only run a portion of the day or have high variation in the velocity of queries (“spiky” workload), leading to idle compute resources.

Limit Over-Provisioning with Serverless On Spiky and Low Uptime Workloads

Most customers experience inconsistent — “spiky” — workloads in production. The spiky warehouse query pattern illustrated below is typical of Infrequent or sporadic batch, Business Intelligence, and ad hoc queries.

Like the previous scenario, the warehouse above consistently serves queries throughout the day — spiking to dozens or even hundreds every hour. Superior performance results in a concurrency gap, visualized in the chart above by Classic (green) spiking higher than Serverless (blue), with none more evident than the highlighted 19:43 timestamp when Classic needs to support 250% of the queries Serverless does. During this spike, Classic will queue hundreds of queries in the load-balancer until warehouse cluster resources free up, worsening the performance and concurrency problem.

This inconsistent query pattern perfectly pairs with a Serverless warehouse’s capabilities to scale up or down in seconds. However, Classic/Pro are, again, better off with a fixed cluster count since spikes begin/end so rapidly. An often overlooked complexity for customers is reasoning about the “right” number of fixed cluster counts for a load-balanced warehouse. Provisioning the max number of clusters would be valuable for the massive afternoon spike, however, that wildly over-provisions for over 99% of the day and would be prohibitively expensive.

An alternative is perhaps 10 clusters. Despite three dozen occurrences of query concurrency spiking to more than 100 (which leads to queued queries), it seems to be a “sensible” option to support the workload yet keep costs under control. In reality, selecting a fixed number of clusters for spiky workloads like this will inevitably result in being wrong most of the day. Achieve a better user experience by pairing near-instant compute and dynamic autoscaling with Serverless. The graph below illustrates this dynamic behavior by overlaying total warehouse clusters (right axis) onto warehouse concurrency (left axis):

*Graph of Serverless concurrency (left axis) mapped to the total load-balanced clusters (right axis).*

Serverless recently introduced Intelligent Workload Management (IWM), a set of features that enhances the ability of serverless SQL warehouses to process large numbers of queries quickly and cost-effectively by leveraging AI-powered prediction capabilities to analyze incoming queries and determine the fastest and most efficient path of execution. Two components of IWM include Intelligent Autoscaling and Advanced Query Routing, which leverage AI models to allow Serverless SQL to respond dynamically to workload demands rather than using static thresholds. As concurrency and warehouse utilization fluctuate, warm clusters can be spun up or down in seconds to accommodate. Serverless keeps queries fast and idle clusters to a minimum.

Classic can only muster a meager seven (7!) static clusters before becoming more expensive than Serverless for this workload. Unfortunately, a Classic warehouse likely requires double that cluster count to serve this query pattern adequately since concurrency would exceed the capacity of 7 clusters almost 100 times per day. However, even doubling the Classic cluster count still isn’t sufficient to handle peak daily concurrency — leaving customers with a worse experience and double the price tag!

Shifting to “low-uptime” scenarios, most warehouses do not serve queries all day, which causes idle time. Every scenario with periods of warehouse idle time between queries also favors Serverless by leveraging shorter “auto-stop” timeouts — as low as 1 minute if set via the SQL Warehouse API. The default auto-termination window is set much longer (45 minutes) for non-serverless SKUs since starting a Classic/Pro warehouse cluster can take up to 5 minutes. Serverless further alleviates concerns about terminating a warehouse by leveraging a persistent results cache shared across all warehouses in a Databricks workspace.

The high concurrency, low uptime, and inconsistent (spiky) query pattern in the workload below is the most compelling for Serverless.

*Graph of serverless query concurrency pattern of low-uptime, spiky workload*

Typical of a few daily batch jobs (i.e., refreshed reports or data engineering jobs), concurrency spikes to dozens of queries half the times the warehouse starts — despite the sporadic consumption and long gaps between usage.

Idle warehouse clusters have the most significant impact on costs for this workload.

Serverless handles the long query gaps well by quickly shutting down rather than idling for 45 minutes (default) like Classic/Pro SKUs. There are nine gaps in which a serverless warehouse will terminate faster than Classic — saving up to 6 hours of idle compute time! Also, the short spikes preclude autoscaling as a viable option for Classic and Pro SKUs, forcing users to select a static number of clusters. Half of the sporadic usage also results in 50 or more concurrent queries, suggesting that the Classic/Pro warehouse should provision 5 or more clusters — an extremely costly choice on a warehouse with 6+ hours of idle time.

*Running cluster count throughout the day. The peak at 10:16 AM raises the cluster count to 15 serverless clusters!*

Idle warehouse time cripples TCO for Classic/Pro SKUs here. Provisioning even just one or two clusters proves more costly than Serverless- let alone the suggested five. The two Classic clusters charted above are 2.4x more expensive than Serverless and wholly inadequate to support the spikes. Even more extreme, Serverless is 20% cheaper than just provisioning ONE Classic cluster in this case!

Serverless will scale to as high as 15 clusters for this workload yet remain a cheaper option than configuring just ONE static Classic or Pro warehouse cluster — crushing the performance of both.

The Outlier: 1 Scenario Where Classic Can Offer Better TCO

Serverless excels in all scenarios. However, it becomes even more valuable as:

Concurrency increases
Uptime decreases (length of time Warehouse received queries per session)
Spikiness increases (variance in the velocity of queries the warehouse needs to execute concurrently)

To ensure readers that this blog isn’t pushing an “always serverless” agenda, presented here is an opposing view in which Serverless performs under the “worst-case scenario”: low concurrency, high uptime, and very consistent workloads.

Suppose a customer has no performance priorities for frequent, short data engineering jobs/alerts that execute 24/7 but stay within the capacity of a single cluster. A warehouse with minimal performance demands that never terminates/scales results in a TCO advantage for Classic that matches the SKU prices — a 21% price gap. Therefore, choosing a Classic warehouse is entirely acceptable for this scenario.

After analyzing all common scenarios, this narrow query pattern is the only one the Classic SKU can claim to have the TCO advantage over Serverless. That said, one can easily justify the small premium for Serverless’s better user experience, features, and simplicity. The TCO gap between these two SKUs on a small, single-cluster warehouse is minimal, and users may find this small gap more than acceptable for simplicity and a better experience.

Conclusion

Customer use cases vary, yet common data warehousing usage patterns emerge. Identification of these common patterns/scenarios serves to simplify the value message. However, despite nuanced analysis of many query patterns, Serverless inevitably emerges as the best TCO option — outside of one narrow pattern. Iterative improvements to performance, user experience, and availability have propelled Serverless to become truly special. Not only should it be considered the best-in-class CDW among its competitive peers (a story for the next blog…), but it has become increasingly difficult to justify other Databricks SKUs over Serverless SQL — unless Serverless is unavailable.

Serverless delivers the best Total Cost of Ownership in nearly every plausible scenario, potentially several magnitudes cheaper than any other SKU option — despite the higher SKU price. One example reviewed above (spiky, low uptime, high-concurrency) is potentially 6x more expensive on Classic if configured with suggested cluster counts. Delivering the best possible product at a lower cost than inferior alternatives is quite rare.

Begin your Serverless experience today!

If you are new to Databricks SQL, serverless trials are available. Please contact your account sales team to help enable Serverless.

Existing customers can convert a warehouse to Serverless in seconds, requiring no commitments and no changes to downstream JDBC/ODBC connections. If unsatisfied, it can be reverted quickly, allowing customers to experience Serverless risk-free.

Footnotes: