AI vs Human: Query performance

Rajesh Balamohan
Waii
Published in
5 min readAug 31, 2023

In this blog we deep dive the question of whether queries generated by Waii (a text-to-SQL platform) can outperform manually crafted queries. Buckle up.

15-second primer: Database engines and query optimizers

Database optimization has been a fruitful topic of research ever since the inception of SQL in the 70s. Modern database engines should be able to find optimal execution plans for queries regardless of how they are written out in SQL. This is especially important in the analytics space where suboptimal plans can cause queries to run for hours and waste significant resources.

Here are some examples of the techniques used to find those optimal execution plans:

  • Automatic join reordering
  • Aggressive predicate pushdown
  • Automatic use of materialized views
  • Removal of unnecessary joins
  • Column pruning

This leads to an interesting question: Do AI-generated queries perform better, worse or the same as queries crafted by humans in the presence of an optimizer?

Query under spotlight:

For the purpose of this blog, we will narrow our focus to query-9 of the TPC-DS benchmark. TPC-DS benchmark is the most widely used industry benchmark for evaluating decision support systems and data warehouses with respect to query speed. And as you will see query-9 tells us a lot about what the current state of the art is.

Query 9: Business question & SQL

Query 9 corresponds to these instructions: “Categorize store sales transactions into 5 buckets according to the number of items sold. Each bucket contains the average discount amount, sales price, list price, tax, net paid, paid price including tax, or net profit

Here is the resulting SQL query:

select case when (select count(*) from store_sales
where ss_quantity between 1 and 20) > 74129
then (select avg(ss_ext_discount_amt) from store_sales
where ss_quantity between 1 and 20)
else (select avg(ss_net_paid) from store_sales
where ss_quantity between 1 and 20) end bucket1 ,
case when (select count(*) from store_sales
where ss_quantity between 21 and 40) > 122840
then (select avg(ss_ext_discount_amt) from store_sales
where ss_quantity between 21 and 40)
else (select avg(ss_net_paid) from store_sales
where ss_quantity between 21 and 40) end bucket2,
case when (select count(*) from store_sales
where ss_quantity between 41 and 60) > 56580
then (select avg(ss_ext_discount_amt) from store_sales
where ss_quantity between 41 and 60)
else (select avg(ss_net_paid) from store_sales
where ss_quantity between 41 and 60) end bucket3,
case when (select count(*) from store_sales
where ss_quantity between 61 and 80) > 10097
then (select avg(ss_ext_discount_amt) from store_sales
where ss_quantity between 61 and 80)
else (select avg(ss_net_paid) from store_sales
where ss_quantity between 61 and 80) end bucket4,
case when (select count(*) from store_sales
where ss_quantity between 81 and 100) > 165306
then (select avg(ss_ext_discount_amt) from store_sales
where ss_quantity between 81 and 100)
else (select avg(ss_net_paid) from store_sales
where ss_quantity between 81 and 100) end bucket5
from reason
where r_reason_sk = 1

And here is the query generated from the instructions above by the Waii for Query 9. It is interesting to note that the generated query is a lot simpler to read and to maintain.

SELECT
MAX(CASE WHEN bucket = 'Bucket1' THEN value END) AS bucket1,
MAX(CASE WHEN bucket = 'Bucket2' THEN value END) AS bucket2,
MAX(CASE WHEN bucket = 'Bucket3' THEN value END) AS bucket3,
MAX(CASE WHEN bucket = 'Bucket4' THEN value END) AS bucket4,
MAX(CASE WHEN bucket = 'Bucket5' THEN value END) AS bucket5
FROM (
WITH bucket_counts AS (
SELECT
CASE
WHEN ss_quantity BETWEEN 1 AND 20
THEN 'Bucket1'
WHEN ss_quantity BETWEEN 21 AND 40
THEN 'Bucket2'
WHEN ss_quantity BETWEEN 41 AND 60
THEN 'Bucket3'
WHEN ss_quantity BETWEEN 61 AND 80
THEN 'Bucket4'
WHEN ss_quantity BETWEEN 81 AND 100
THEN 'Bucket5'
END AS bucket,
COUNT(*) AS sales_count,
AVG(ss_ext_discount_amt) AS avg_discount_amt,
AVG(ss_net_paid) AS avg_net_paid
FROM snowflake_sample_data.tpcds_sf10tcl.store_sales
GROUP BY 1
)
SELECT
bucket,
CASE
WHEN (
bucket = 'Bucket1' AND sales_count > 74129
)
OR (
bucket = 'Bucket2' AND sales_count > 122840
)
OR (
bucket = 'Bucket3' AND sales_count > 56580
)
OR (
bucket = 'Bucket4' AND sales_count > 10097
)
OR (
bucket = 'Bucket5' AND sales_count > 165306
)
THEN avg_discount_amt
ELSE avg_net_paid
END AS value
FROM bucket_counts
) AS subquery

The hand-written query and the generated query are semantically equivalent. Even though they look very different, they produce the same result and could be used interchangeably in any report or dashboard.

We ran both these queries in 10 TB TPCDS dataset in a popular data warehousing platform. We disabled result-set caching for fair comparison. Each query was run 3 times, and the average is shown here:

Waii generated query is roughly 2.5x better in response time compared to handwritten query. What is going on?

Deciphering the results: Deep Dive on why AI query was 2.5x better

Let’s look at the query plan below to understand the execution details: In the handcrafted query, "store_sales" table had to be scanned multiple times over 15 stages. This resulted in substantial I/O and processing costs, ultimately contributing to runtime delays. This approach demanded processing of over 1+ TB of data across multiple stages.

In contrast, the AI-generated query pushed "count" and "sum" aggregations to an earlier stage in the query. Consequently, the "store_sales" table was just scanned once. The system processed the data once and propagated the aggregations upstream in the processing layer. The resulting processing load was a mere 158+ GB, a significant reduction compared to the handwritten query's 1+ TB.

Even though we said earlier that database optimizers should always produce the best plans in reality that’s not the case. Due to the large search space and varying optimization methods across systems, achieving the best plan quickly becomes a challenge for database optimizers. Combining uncorrelated subqueries is a technique that the database we ran on doesn’t use (yet) and as a result it matters how the SQL is actually written, something that practitioners are very aware of. Note that different SQL engines have their own unique query rewriting techniques as integral part of their optimization process.

Conclusion:

Waii’s text-to-SQL capabilities use an array of AI techniques to understand natural language queries and generate executable SQL queries for databases. It can significantly boost the productivity of business analysts and data engineers. We have also observed that AI generated TPCDS queries often provide a significant improvement over hand-written ones and in this blog we showed an example of why that is the case. There is a lot more work that needs to be done to understand how widely applicable this is and we’re looking forward to spending more time on it.

--

--