GPT-4’s SQL Mastery

Published in

Waii

9 min readApr 10, 2023

“Maybe ‘text to SQL’ is a solved problem now, you know, like chess and go.”

Early tests certainly seemed to indicate that this was true. GPT produces high quality SQL from regular ambiguous language right out of the box. You can try for yourself or look at social media for many astonishing examples.

However, we wanted to find out to what extent this really is true apart from these anecdotes, so we set out to perform a number of industry benchmarks to dig deeper.

As you’ll see, the result is staggering: In a very real way, SQL generation is a solved problem using GPT-4 with the right fine-tuning and supporting techniques. Yet we also believe that to apply it to real-world, enterprise use cases there is still a significant amount of work to be done to unlock the potential.

SQL, seriously?

Yes, SQL is ancient. Unintuitive. Complex. Inefficient. But SQL is also still the lingua franca of the data processing world. Decades of improvements in how we handle larger and larger data sets and how we derive knowledge from data haven’t changed that.

SQL is still the interface between humans, tools, processing engines and data. As a result, the data scientists we have spoken to indicate that they still spend on average 40–80% of their time hand-crafting SQL queries.

Being able to translate regular language to SQL can thus make a lot of people’s lives a lot easier. But maybe more exciting even: It can also put the power of specialized data tools into the hands of the uninitiated. TikTok for the data world — everyone can be a creator.

SQL is interesting from another perspective as well: It’s a more constrained problem than general programming languages like Python.

SQL language is Turing-complete to be sure, but the way it is used in modern data applications is more limited. The problem domain is outlined by what table schemas can describe and the language itself gets by with a relatively static set of operations and functions. It is in many ways the ideal candidate for being automatically generated.

How do you know if it’s any good?

We decided to start with the Spider dataset. The benchmark is tailor made for testing SQL generation. It features ~5000 queries on 200 databases, each with multiple tables. The queries range from simple to complex and the datasets were chosen from ~100 different domains. It’s been designed to cover as much ground and be as representative as possible. Very impressive work.

That is not to say that Spider isn’t without problems for this application. First and foremost GPT seems to have been trained on Spider, which makes this decidedly more of an open-book exam. Given the massive amount of data current LLMs are trained on, this will soon be true for any public dataset.

Second, the benchmark is synthetic. Most of databases and tables have not been used in anger, nor have they evolved over time to solve real-world problems. As a result, the databases are simpler and cleaner than what you would expect of a more realistic dataset. In theory, practice and theory are the same, in practice they are not.

Next, looking at the queries from an analytics perspective, the Spider queries classified as “extra hard/complex” are on the simpler end of the spectrum. Real workloads often require integrating multiple datasets and several layers of computing specific metrics before presenting the results. We didn’t find much of this in the benchmark.

A final challenge we ran into is the target database. We’re interested in evaluating these methods on multiple database systems, starting with what is currently the most popular one: Snowflake (see DB-Engines rank). The Spider benchmark runs out of the box on SQLite, which means the expected queries and result sets suffer from the classic database incompatibilities and gotchas (row-ordering differences, data type semantic differences, non-standard language extensions used, etc). You can find a Snowflake compatible version of the benchmark here.

Spider is not the end-all-be-all, but it is a good baseline to start the evaluation.

If at first you don’t succeed…

So, what’s next? Off we go, let’s run the benchmark. Well, not so fast. If you just ask GPT for SQL given the benchmark questions, you end up with fairly low accuracy. This is were fine-tuning and prompt engineering comes in. You have to decide which additional information to give to the LLM and you have to take into consideration which format and sequence you use to present said information. Here are the different dimensions we evaluated to achieve the results listed below:

Schema: The most basic thing to do to tell your LLM about the table schemata. It improves accuracy enormously and is fairly easy to do. But it in itself poses a number of problems: Which tables do you include? How do you textually represent each schema? Which datatypes do you include? Do you need data types at all? All of these impact the outcome to some degree.
Constraints: Things such as referential integrity, column constraints, and functional dependencies can be included. These need to be carefully evaluated, because adding the wrong ones can confuse the model, while not giving the right ones will lead to incorrect results.
Query Examples: Having access to the query history run against a database can reveal a lot about the semantics of the data model. The biggest problem here is that the body of queries is likely large and needs to be heavily curated to see what is helpful to the translation at hand. Too many query examples or the wrong ones can cause the model to make up table and column references. Using fragments instead of full queries can also help focus the LLM on the right aspects of the selected examples.
Samples and shape: Listing examples of the data can increase the quality of the generated queries. So can describing what the overall data looks like (e.g.: Giving a histogram of values or specifying the actual column domain). Realistically, no one wants their actual data be sent over the wire for tuning and we have limited our tests to only use techniques that do not leak any data.
Semantics: You’re not always going to be so lucky that all the semantic information about a database is encoded in table and column names. Sometimes comments or separate documentation will clarify the meaning of the objects in the database. Additional systems might be used to lookup and understand concepts as well (e.g.: Alation). ETL tools and scripts (e.g.: DBT) contain information about additional relationships. Integrating this information can dramatically improve accuracy.
Chain of thought: Providing a framework for approaching the problem and asking the LLM to explain what steps were followed can lift the accuracy. Different LLMs can react very differently to these approaches.
Multi-step: In some instances is pays off to break the problem down into multiple steps, or to iterate on the result if it’s not quite correct yet. This can lead to improved accuracy, but also has the potential to bring about more hallucinations, especially if the sequence is getting too long.
Compilation: LLMs don’t reason about queries, compile them, or do semantic analysis. That doesn’t mean we can’t do that separately in order to improve accuracy. You can either feed hints after the analysis back to the LLM, or use heuristics to patch up a proposed query. This is also a good place to make sure the query complies with the intended target DBMS.
Human guidance: As a last line of defense, a user can provide hints to the LLM to have it re-generate a query. There are two different types of tweaking a result this way: a) Just looking at the output of the query, you can describe what might have gone wrong or is missing. b) You can read the query and give feedback on the SQL code directly. Since (b) assumes at least reading knowledge of SQL, we only allowed hints of type (a) in the experiments below.

We will use the following classification of the techniques to discuss the results:

The big reveal

It should be noted here that we chose a slightly different way of evaluating the results than the one Spider defines. What matters most to us is that the SQL produces a semantically correct answer to the question that was posed. That is how we envision these systems to be used.

We thus focussed on the result set of the query and made sure it contained the correct information (allowing for different possible row orderings, extraneous columns, different answers for ambiguous questions, etc).

We did specifically not confirm that a certain number of “group-bys” or “joins” were used in the queries. There are many ways in SQL to get to the same answer and we did not want to force any one mechanism.

Without further ado, the 6 lines you actually wanted to see:

* RESDQL paper
** ChatGPT Text-2-SQL Paper

First off, the quality of the translation is absolutely amazing: Using GPT with a just a basic prompt matches or outperforms the best methods that came before. The rate of improvement is also staggering. GPT-3 v GPT-4 is a significant increase in accuracy. Going from 80% to 90% is not the same as going from say 50% to 60%. Improvements are getting increasingly more difficult the higher the baseline is and GPT-4 performs this well right out of the box.

Using the right techniques to prompt and tune makes a significant difference. It’s really what pushes the results over the top. For the time-being using these techniques are essential to get the best accuracy. Future versions of GPT/future LLMs might give similar results with less effort.

And finally, if you allow human input to guide the model you basically can solve all the problems in the benchmark. That is what we were referring to in the beginning of the blog. Generating SQL from English is basically a solved problem.

There is one pattern that is worthwhile pointing out in the queries that failed, or required human intervention: Complexity of the query seems to be the best indicator for success. The more tables, columns and operations required for a question, the less likely the translation will succeed.

…and beyond!

Our ultimate goal is to understand how well we can tune the accuracy of query generation on a typical enterprise warehouse. One that the LLM has not been specifically trained on. This means more has to be done to look into the following limitations:

Real-world private databases: Maybe the most important additional work is to validate against real datasets. While the results we have seen are impressive, nothing compares to having actual data scientists ask questions from their ever-growing and evolving production warehouses. We are currently gathering that information.
Wide data models: One thing that becomes immediately clear when working with real-world warehouses is that everything is wider than specified in synthetic workloads. There are more databases, more tables in those databases, more columns in those tables and more records to top it off. That adds a lot of additional dimensions to the problem space. Which tables should the translation focus on? Which columns? What are the hierarchies and lineage? Which representations should be used?
Complex queries: The other thing that becomes clear is that the complexity of queries is often higher than in the benchmarks. Complexity can loosely be thought of as the number of joins, subqueries, set operations, and CTEs. Any of these recursive constructs make generating the queries more complex and require additional techniques, such as the “multi-step” approach mentioned above. Improving accuracy is especially important in this area.

That’s all folks…

And that’s a wrap for now. We hope we were able to shed some light on the capabilities of these new models and convince you that we have entered a new era. We’re currently working on lifting the mentioned limitations and hope you will stay tuned for upcoming blogs on what we will find!

If you’re interested in seeing this in action for yourself, please join our Slack channel: Wai’i community, or request access at waii.ai

This post was written by Wangda Tan and Gunther Hagleitner.