GPT-4 In Winter SQL-land: Tackling Snowflake Schemas!
‘This query is so much better than anything I could have written’ — Something I won’t admit to myself.
In the last blog we concluded that the Spider benchmark for natural language to SQL conversion is basically solved, but that there are still open questions. The most pressing ones for us were: (a) What about more realistic wider databases/tables, (b) what about more complex analytical queries, and (c) what about real-world datasets (vs. synthetic/public benchmark data)?
In this blog we look at (a) + (b). And we decided to call upon an old friend to do so: TPC-DS
TPC-DS is a performance benchmark, mate.
Well, yes. TPC-DS has been created to measure the performance of database systems on analytical workloads. It does so by defining a typical star/snowflake schema for a fictitious retailer.
There are fact tables such as store- or web sales, and dimension tables such as customer, or household_demographics. It defines a data generator to fill these tables with however many rows you like. Finally, it defines a set of 100 complex analytical queries to measure your warehouse performance.
So, how is this useful for benchmarking natural-language-to-SQL capabilities? In the addendum the benchmark defines the business questions that these 100 queries embody. This was probably provided to justify why these complex queries are useful benchmarks for a warehousing system.
We decided to flip this around: Use the business queries and see if we can generate the corresponding SQL. As a bonus we can compare the generated queries versus the benchmark queries on performance as well.
An example will help visualize. This is the business question that underlies query #11:
Find customers whose increase in spending was larger on the web than
in stores for the year 2000, when compared to previous year.
You can tell it was a simpler time when the benchmark was created. People were wondering if web is real and if it adds anything meaningful to brick-and-mortar stores. No spoilers, please.
However, the question templates are still relevant. These are similar questions to what a data scientist might have to answer to today: “Which customers are adopting our new sales channels the quickest?” is still relevant.
We’re using the same optimized combination of techniques as defined in the last blog (schema, compilation, chain-of-thought, multi-step, etc), the same backend that gave us the Spider results and we are using GPT-4 as the LLM for this evaluation. We’ll refer to this system as QueryMind (QM) going forward to simplify talking about it.
Here’s what QM will produce:
WITH web_sales_filtered AS
SUM(ws.WS_NET_PAID) AS total_spending,
FROM snowflake_sample_data.tpcds_sf10tcl.web_sales ws
JOIN snowflake_sample_data.tpcds_sf10tcl.date_dim dd
ON ws.WS_SOLD_DATE_SK = dd.D_DATE_SK
WHERE dd.D_YEAR IN (1999, 2000)
GROUP BY ws.WS_BILL_CUSTOMER_SK,
SUM(ss.SS_NET_PAID) AS total_spending,
FROM snowflake_sample_data.tpcds_sf10tcl.store_sales ss
JOIN snowflake_sample_data.tpcds_sf10tcl.date_dim dd
ON ss.SS_SOLD_DATE_SK = dd.D_DATE_SK
WHERE dd.D_YEAR IN (1999, 2000)
GROUP BY ss.SS_CUSTOMER_SK,
(ws2000.total_spending - ws1999.total_spending) AS increase
FROM web_sales_filtered ws1999
JOIN web_sales_filtered ws2000
ON ws1999.WS_BILL_CUSTOMER_SK = ws2000.WS_BILL_CUSTOMER_SK
WHERE ws1999.D_YEAR = 1999
AND ws2000.D_YEAR = 2000),
(ss2000.total_spending - ss1999.total_spending) AS increase
FROM store_sales_filtered ss1999
JOIN store_sales_filtered ss2000
ON ss1999.SS_CUSTOMER_SK = ss2000.SS_CUSTOMER_SK
WHERE ss1999.D_YEAR = 1999
AND ss2000.D_YEAR = 2000)
wi.increase AS WEB_INCREASE,
si.increase AS STORE_INCREASE
FROM snowflake_sample_data.tpcds_sf10tcl.customer c
JOIN web_increase wi ON c.C_CUSTOMER_SK = wi.WS_BILL_CUSTOMER_SK
JOIN store_increase si ON c.C_CUSTOMER_SK = si.SS_CUSTOMER_SK
WHERE wi.increase > si.increase;
It’s worthwhile pointing out that this is a remarkable result. The query is relatively complex and uses advanced concepts to get to the result. There are 6 joins, 2 group by’s and 4 common table expressions. Also QM made confident decisions on how to compute spending increases or how to represent a customer. The query definitely satisfies the business question, but does it produce the same output as the benchmark?
For a reason that data scientists are very familiar with: The benchmark assumes different definitions of certain concepts: Instead of adding up the NET_PAID amounts, the benchmark defines “spend” as “price minus discount”. There are almost always different ways to interpret things and making this consistent within an organization is hard. How this can be handled with LLMs is the topic of another blog. For now we just refine the result by adding these prompts to QM interactively (look at tweaks 2, 3 and 4):
The result is a query, which does yield the same results as the benchmark defines! Here are the updates that QM made to the query in response to the additional information. We show it as inline differences to give a sense of how interactive feedback evolves the SQL:
GPT-4 gets promoted to Principal SQL engineer
We would also like to point out that the query produced here is much better than the one you will find for instance in the Snowflake workbook that comes with every Snowflake account (i.e.: the Snowflake TPC-DS query itself). It is easier to read and digest because of the way it is structured. You can top-down step-by-step follow thru with what is being achieved. Something like:
Here are the two queries side-by-side. On the left is the QM-generated query, on the right is the TPC-DS query. See for yourself which one you would prefer to read or maintain (open the image in a new tab if you want to see the details).
(Snowflake — you know where to find us, if you want help cleaning this up.)
The fine print
Because of the different ways to interpret the queries and myriad ways of writing the queries themselves, it was necessary to manually inspect the results of each of these queries. We have put the queries and analysis here for folks that want to try it themselves.
We also only picked 10 queries at random from the 100 available queries. It’s a small sample, but gives an indication of what to expect.
Aaaand that’s not entirely true. We picked 9 queries at random. Query 27 was selected as the first query because of a long history we have with it (Shout-out: Gopal, Nita, Ashutosh, Jesus, Thejas & Arun).
We captured the times it took for QM to generate the queries using GPT-4 and also the time it took to run the queries on a medium warehouse (with the 10TB TPC-DS dataset).
Without further ado, here are the results:
Here’s what the columns mean in detail:
- Succeeded w/o Human guidance: Means that QM generated a query that answers the business question. Without any additional human / user input besides the question.
- Human guidance required: No — means that the query produced the right output w/o any additional human/user input. No* means the same as No, but that additional input was required to achieve the same output as the TPCDS query. That’s due to the different possible interpretations of the question — like the examples given above for query 11. Yes — means that the query didn’t answer the business question and required additional input. Direct SQL edit means that we had to resort to actually editing the SQL by hand to “solve” the question.
- Query Generation Time: That’s the time QM took to generate the initial query in seconds.
- Execution Time Speedup Factor: That’s the ratio of the runtime of the TPC-DS query divided by the runtime of the query that QM produced.
What does it all mean then?
The results are very different from what we’ve seen with the Spider benchmark. The queries are more complex and so is the data model. As a result, we were not able to achieve the same accuracy: We had to provide a lot more human input and in one case had to even resort to editing the SQL query by hand. Overall it seems this benchmark can still teach us something.
Advanced techniques like multi-step and query compilation were used much more frequently here, than with the simpler queries. Without them few (if any) queries would have succeeded.
Selecting the right information for the LLM is even more critical here than it was in the Spider benchmark. The complexity of the model can easily overwhelm the LLM.
Query complexity (i.e.: number of operations, especially joins, set operations, window functions, aggregates) is still the best indicator for success. Query 64 for instance is by far the most complex query — it requires 13 tables and a wide range of set operations / joins / etc. QM got close, but we couldn’t get the query quite right.
A very interesting result is that 9/10 queries actually performed on par or faster than the expert-tuned TPC-DS queries. That’s important: We want non-experts to use these techniques to write queries w/o breaking the warehouse by writing inefficient queries. It’s also an interesting angle for query optimization that we will explore further in future.
In conclusion: Our techniques work really well in a highly complex environment. The potential time savings and efficiency gains that these methods can bring to data practitioners is enormous. Yet, the value can only be unlocked by a combination of the described (advanced) techniques and an appropriate user experience.
The next step for us is to move away from these synthetic workloads and to look how well this performs in the real world. We are in the process of testing on warehouses that have been used in anger for years to see if these findings will hold in those environments.