Text-to-SQL Excellence: An Evaluation of Sonnet 3.5 and GPT-4o

Published in

Waii

13 min readAug 1, 2024

The ability to query databases via natural language is revolutionizing the way users interact with data. Choosing the right LLM for Text-to-SQL is essential for generating accurate queries.

The best models we’ve seen on production workloads so far are Claude 3.5 Sonnet and GPT-4o. In this blog, we are taking an in-depth look at comparing these two leading LLMs in tasks required to perform Text-to-SQL. We aim to go beyond basic question/SQL pair analysis by evaluating accuracy, performance, and cost of real-world SQL generation related tasks.

Key factors in evaluating LLM for Text-to-SQL

Building a Text-to-SQL platform is complex and involves more than generating code. We evaluate the best LLM for Text-to-SQL using critical subtasks, broadly categorized into “Knowledge Graph Curation” and “Query Building.”

Knowledge Graph Curation

We drive query generation with a Knowledge Graph (KG) that integrates all database and business information, including table/column descriptions, common join graphs, statistics, data formats, classifications, business rules, definitions, and so on.

Common KG operations include table selection, column filtering, join graph prediction, and filter value sanitation. For this benchmark, we focus on:

Table Description: The system should provide accurate and descriptive table summaries for both humans and LLMs. These summaries must be precise and include relevant information. Accurate table descriptions serve a dual purpose: they help data practitioners understand and use the database, and they provide essential inputs for automating tasks like searching for and selecting the right tables and columns.

Column Description: Accurate column descriptions are critical for generating precise queries. Detailed descriptions help prevent LLMs from misinterpreting columns. Providing sample values for columns is particularly useful, especially with nested structures. Striking the right balance in the level of detail is challenging — too much information leads to wasted tokens and confusion, while too little omits critical context. As with table descriptions, these descriptions are used by both humans and machines.

Table Selection: Choosing the relevant tables for a query is essential for generating accurate SQL. LLMs assist in this process by understanding the context of the natural language query and mapping it to semantically meaningful tables from the join graph. They can also help resolve tie breaks between similar tables. While table selection is always challenging, the algorithm must be able to scale and handle schemas with thousands of tables.

Column Filtering: In production systems, wide tables with hundreds or thousands of columns are common. It’s crucial for the system to select the appropriate columns needed for the specific query, ensuring accuracy and relevance.

Join Prediction: Correct join conditions using relevant fields are key for generating accurate SQL. The model must understand the relationships among tables and determine the appropriate join conditions to combine data meaningfully.

Query Generation

Query generation (QG) involves multiple steps, including SQL generation, syntax and compilation checks, and interactive query update.

SQL Generation: Generating an SQL query requires the model to understand natural language input, analyze the question’s context, and incorporate user-provided business information to translate them into valid SQL. Depending on the complexity of the query, the model may need to reference multiple tables and joins, apply column filters, and include question-specific filters and relevant business context-related computations.

An SQL generation task also needs to generate and follow a “chain of thought” (CoT) to break down complex tasks into smaller, manageable steps, ensuring the generation of accurate queries.

Syntax and compilation checks: Ensuring the generated query is syntactically correct according to the requested SQL dialect and is compilable is essential. Implementing relevant retry processes helps address any issues, ensuring successful query generation.

Interactive query generation: Users often prefer to incrementally build complex datasets or answer intricate questions. The model must understand these follow-up requests and adjust the SQL accordingly without losing context.

In addition to the core tasks mentioned above, there is much more to query generation. Please refer to the “Future Work” section for a discussion on additional aspects we plan to measure in the upcoming months.

Evaluation criteria

For the critical tasks mentioned above, we evaluate on 3 different dimensions: accuracy, performance and cost.

Performance: This represents the end-to-end time taken to generate the respective artifact, from initial prompt to complete output of the artifact. This applies equally to all tasks.

Cost: We use input and output tokens as a proxy for cost. Many public LLMs charge based on token usage, and for self-hosted LLMs, this number is still related to the cost (proportional to GPU hours). Again this equally applies to all tasks.

There is a significant difference in the cost (i.e.: the number of tokens) between these models and this variance can be attributed to several factors in the query generation process: Firstly, different models may require varying levels of CoT reasoning to achieve accurate outputs, affecting token usage. During query generation, models choose objects to evaluate, with some models selecting more objects than others, leading to more LLM round trips and higher token counts. Another factor is retries, if a model can’t generate the correct query on the first attempt, it requires additional attempts, further increasing token usage.

Accuracy: Accuracy is assessed differently for different tasks. When deciding the accuracy of a generated query (one shot, interactive, fixing compilation problems), we compare it with a manually validated ground truth query, primarily focusing on the user intent. Irrelevant details like ordering of the rows (unless specifically requested, of course) are ignored as they are not representative of accuracy. More details can be found in CIDR paper and archerfish-benchmark.

For table selection and column filtering, we also compare to a known good set of tables and columns respectively. We classify objects into required, optional and incorrect by user intent and evaluate against them. Details can be found in the mentioned paper.

For table/column descriptions, we used a panel of five data practitioners and manually evaluated the responses provided. The panel had access to and was familiar with the databases. They evaluated the descriptions based on the following criteria:

Clarity: Is the description clear and easy to understand?
Accuracy: Is it accurate and factually correct?
Completeness: Does the description include all relevant details?
Relevance: Is the description relevant to the table and the actual columns?

The users did not know which LLM generated the descriptions, ensuring the evaluation was completely anonymized. The result is an average of these scores.

Datasets considered

For this evaluation, we used a combination of publicly available schemas and private schemas sourced from real-world production databases. (Private schemas ensure that the LLM was not trained on them previously.)

For table/column descriptions, we evaluated across three private schemas (622 descriptions) across different domains in both operational and analytical spaces. This helps in evaluating the models capabilities to understand and interpret different schemas effectively.

For table and column selection benchmarks, we chose a mixture of schemas that varied from less than 10 tables to as high as 1251 tables and evaluated 191 questions.

For query generation benchmarks, we considered multiple schemas and categorized them into simple, medium and complex questions.

Simple Questions: We selected 150 simple questions spanning different schemas. Simple questions require a smaller number of tables per schema and have straightforward filters and joins. While we initially used a much larger set of queries (~1,000), we found that a smaller sample size provides comparable results, as the outcomes remained consistent through testing on larger sample sizes. This focused set of simple questions effectively evaluates the model’s capabilities in understanding a variety of schemas across different domains, which are commonly encountered in real-world applications on a daily basis.

Medium Questions: For medium questions, we used three private schemas with different data types. One of the schemas had deeply nested JSON fields, one had over 40 tables, and another schema had 708 fields. This allowed us to assess the model’s capability to handle a variety of data types and joins. We evaluated 135 medium questions on these schemas, focusing on the model’s ability to manage more complex queries that require a deeper understanding of the data.

Complex Questions: We used star schemas for evaluating complex questions, which involve joining at least 5–10 tables. We tried 30 questions in this category. These complex queries are less frequent but critical, as they test the model’s ability to generate accurate SQL for highly sophisticated and detailed scenarios involving 5–10 table joins. Although we selected a smaller dataset, we were able to see consistent and clear patterns in the results.

For the interactive questions benchmark, we evaluated 75 questions using a private schema. Each question began with a base ask, for which SQL was generated, followed by a follow-up question. The result of the follow-up question was then compared against the ground truth query.

Benchmark details and results:

Here are the benchmark results across various critical sub-tasks we mentioned in the earlier section. In the next sections, we will cover the benchmark details and the findings.

Table selection benchmark:

In this benchmark, we focussed on understanding LLM’s capability to pick up the right tables needed for answering the question. We considered various schemas with the number of tables ranging from less than 10 to as high as 1251 tables. Overall, the two models were on par with GPT-4o having an accuracy of 89.68% and compared to Sonnet’s 89.56% accuracy. However, a closer analysis revealed notable differences across the individual schemas.

Schema 1+2: < 25 tables
Schema 3: 200 tables
Schema 4: 1251 tables

With a smaller number of tables, the two models are on par. Sonnet handles medium sized schema (~200 tables) 3.34% better than GPT-4o. Lastly, with many tables (1251) in the schema, we observed that GPT-4o performs 5.01% better than Sonnet.

Column filtering benchmark:

Tables in the production environment vary from a few columns to a few hundred columns per table. This benchmark is targeted towards understanding whether LLM is able to choose the most relevant fields needed for generating the SQL.

For this we considered querying from a private schema with 750+ columns. Failures in Sonnet were related to the confusion in choosing the similar columns present in the table (e.g.: prof_svcs_fee was chosen instead of professional_services column).

As you can see from the summary graph, GPT-4o performed 11.4% better than Sonnet for wide tables.

Table/Column description benchmark:

We analyzed the table descriptions generated by LLMs across various private schemas. See ‘Evaluation Criteria’ section for details on the evaluation process. In summary, the descriptions were evaluated for clarity, accuracy, completeness, and relevance. Based on these criteria, the final score was derived for understanding the table and column descriptions.

From the summary graph, it is clear that GPT-4o performed slightly better in table descriptions, as it was 3.28% better than Sonnet, but Sonnet performed significantly better at generating column descriptions, 6.38% over GPT-4o. Sonnet generated descriptions were detailed and easier to understand.

Query complexity:

In this section, we focus mainly on query generation tasks. For this we evaluated simple, medium and complex questions.

Simple Questions:

For simple questions, both Sonnet and GPT-4o performed remarkably well. Analyzing the results revealed that where Sonnet generated wrong queries, it was mostly due to handling ambiguity in the language, or with join selection. As an example, consider the following query generated by Sonnet:

Ask: List the most common hometown of teachers. Show me all common hometowns.

WITH teacher_counts AS (
    SELECT
        t.hometown,
        COUNT(DISTINCT t.teacher_id) AS teacher_count
    FROM spider_dev.course_teach.teacher AS t
    INNER JOIN spider_dev.course_teach.course_arrange AS ca <-- Not needed
        ON t.teacher_id = ca.teacher_id
    GROUP BY
        t.hometown
)
SELECT
    hometown,
    teacher_count
FROM teacher_counts
ORDER BY
    teacher_count DESC

Note that the join with “course_arrange” was not needed for this question. It could have retrieved the answer from the “teacher” table itself as follows.

SELECT Hometown
FROM Spider_Dev.Course_Teach.Teacher
GROUP BY Hometown
ORDER BY COUNT(*) DESC

From a performance standpoint, GPT4o performed 25.6% better than Sonnet for query generation time, which is significant for simple questions, even though they consume nearly same amount of tokens for the query generation.

Medium Questions

As can be inferred from the summary graph, both GPT-4o and Sonnet performed well in medium questions across multiple schemas, with an overall accuracy of 89.69% and 90.39% respectively. It is especially worth noting that JSON fields were handled well in both models.

Interestingly in one of the schemas, Sonnet performed 6.56% better than GPT-4o. Analyzing further on this revealed that GPT-4o got confused with fields like “total_spent, total_price”. There is a subtle difference between “total_spent” and “total_price” fields in schema and LLM picked one instead of the other, leading to query failure.

Sonnet consumes 15–17% more tokens than GPT-4o. It is attributed to the way the number of tables are chosen for generating the query. This is discussed more in the complex query section.

With respect to query generation time, Sonnet takes much more time than GPT-4o for query generation. Though some of it is attributed to the higher number of tokens consumed by Sonnet, in one schema it was orders of magnitude slower as compared to GPT-4o. Overall across schemas, Sonnet takes on average 43.15% more time than GPT-4o.

Complex Questions

As mentioned earlier, for generating complex queries the system has to generate 5–10 table joins. In this case, both Sonnet and GPT-4o performed equally well. Analyzing the results showed that Sonnet occasionally misinterpreted fields, causing failures. For example, in q10, this misinterpretation led to query failure.

Ask: Count the customers with the same gender, marital status, education status, purchase estimate, 
credit rating, dependent count, employed dependent count and college dependent count who live in certain 
counties ('Rush County','Toole County','Jefferson County', 'Dona Ana County','La Porte County') and who 
have purchased from store sales channel and also from another sales channel (like web sales or catalog sales) 
during a three month time period for the year 2002. sort by gender, marital status, education status, purchase estimate, 
credit rating, dependent count, employed dependent count. show only 100 rows.


WITH other_channels AS (
    SELECT web_sales.ws_bill_customer_sk AS customer_sk
    FROM tweakit_perf_db.sample_tpcds.web_sales
    INNER JOIN tweakit_perf_db.sample_tpcds.date_dim
        ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
    WHERE
        date_dim.d_year = 2002 AND date_dim.d_moy BETWEEN 1 AND 3
    UNION ALL
    SELECT catalog_sales.cs_bill_customer_sk AS customer_sk
    FROM tweakit_perf_db.sample_tpcds.catalog_sales
    INNER JOIN tweakit_perf_db.sample_tpcds.date_dim
        ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
    WHERE
        date_dim.d_year = 2002 AND date_dim.d_moy BETWEEN 1 AND 3
),
store_sales_customers AS (
    SELECT DISTINCT store_sales.ss_customer_sk
    FROM tweakit_perf_db.sample_tpcds.store_sales
    INNER JOIN tweakit_perf_db.sample_tpcds.date_dim
        ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
    INNER JOIN other_channels
        ON store_sales.ss_customer_sk = other_channels.customer_sk
    WHERE
        date_dim.d_year = 2002 AND date_dim.d_moy BETWEEN 1 AND 3
),
customer_info AS (
    SELECT
        c.c_customer_sk,
        cd.cd_gender,
        cd.cd_marital_status,
        cd.cd_education_status,
        cd.cd_purchase_estimate,
        cd.cd_credit_rating,
        hd.hd_dep_count,
        hd.hd_vehicle_count
    FROM tweakit_perf_db.sample_tpcds.customer AS c
    INNER JOIN tweakit_perf_db.sample_tpcds.customer_address AS ca
        ON c.c_current_addr_sk = ca.ca_address_sk
    INNER JOIN tweakit_perf_db.sample_tpcds.customer_demographics AS cd
        ON c.c_current_cdemo_sk = cd.cd_demo_sk
    INNER JOIN tweakit_perf_db.sample_tpcds.household_demographics AS hd
        ON c.c_current_hdemo_sk = hd.hd_demo_sk
    WHERE
        ca.ca_county IN ('Rush County', 'Toole County', 'Jefferson County', 'Dona Ana County', 'La Porte County')
)
SELECT
    customer_info.cd_gender,
    customer_info.cd_marital_status,
    customer_info.cd_education_status,
    customer_info.cd_purchase_estimate,
    customer_info.cd_credit_rating,
    customer_info.hd_dep_count,
    customer_info.hd_vehicle_count,
    COUNT(*) AS customer_count
FROM customer_info
INNER JOIN store_sales_customers
    ON customer_info.c_customer_sk = ss_customer_sk
GROUP BY
    customer_info.cd_gender,
    customer_info.cd_marital_status,
    customer_info.cd_education_status,
    customer_info.cd_purchase_estimate,
    customer_info.cd_credit_rating,
    customer_info.hd_dep_count,
    customer_info.hd_vehicle_count <-- groups by vehicle count instead of employed dependent count.
ORDER BY
    cd_gender,
    cd_marital_status,
    cd_education_status,
    cd_purchase_estimate,
    cd_credit_rating,
    hd_dep_count,
    hd_vehicle_count
LIMIT 100

With respect to query generation time in complex queries, GPT-4o generates the query 42.4% faster than Sonnet 3.5. Sonnet 3.5 tends to select more tables than necessary, for e.g, in Q69 we observed that Sonnet chooses 18 tables whereas GPT 4o chooses only 8 tables. Note that choosing more tables is not bad, but it increases the time taken for query generation.

From the token consumption perspective, GPT 4o is 34.3% better in token consumption compared to Sonnet, likely due to Sonnet having additional retries to generate a correct query.

Interactive query generation

As you can see from the summary graph, GPT-4o performed much better in this benchmark. Analyzing the query failures in Sonnet revealed that it often would miss transferring required constructs from the base question. Here is an example:

Base ask: Get the list of TV series created by "Mark Goodson". Display asset_title.

Follow up question: Include the names of awards won

Generated Query:
WITH award_winners AS (
    SELECT DISTINCT
        a.awardname,
        p.name
    FROM movie_db.movies_and_tv.awards AS a
    INNER JOIN movie_db.movies_and_tv.people_awards_bridge AS pab
        ON a.award_id = pab.award_id
    INNER JOIN movie_db.movies_and_tv.people AS p
        ON pab.people_id = p.people_id
    WHERE
        a.iswinner ILIKE 'True'
)
SELECT
    awardname AS award_name,
    name AS winner_name
FROM award_winners
ORDER BY
    award_name

In this query, it missed the filters for “Mark Goodson” from the original question.

Cost factor

For the cost comparison, we examined the token consumption of GPT-4o and Sonnet 3.5 for generating complex queries and calculated the associated cost per query based on the vendors’ pricing models.

Please note that this comparison is based on the public cost details from the vendor’s sites as of July 18, 2024. Average token consumption breaks down into 20% output tokens and 80% input tokens. The cost of generating complex queries with Sonnet was 17.36% higher than with GPT-4o.

Future work

We only picked a select few criteria in this blog post. There are many more tasks that are common in SQL generation and data-heavy LLM agents that we plan to look at in future. Breaking it down by KG and QG again:

KG curation:

Capturing business-specific information from users (e.g., computing revenue from a set of tables/columns) and including it in the knowledge graph can greatly impact accuracy.
Task to deal with any schema changes (table, column) in the databases.

QG:

Evaluating the way the LLM plans to generate the query, similar to an explain-plan for query generation.
Identifying relevant business information (from potentially 100s of them) to ensure they are applied correctly to the question.
Ensuring the LLM’s ability to support multiple database dialects and UDFs specific to them.
Fixing existing errors in the queries without losing context.

Conclusion

Our evaluation of GPT-4o and Claude 3.5 Sonnet reveals that both models demonstrate strong capabilities, with each excelling in different areas:

Schema Understanding: GPT-4o performs slightly better in table selection and table descriptions, especially with large schemas (1251 tables). Sonnet, however, excels in generating detailed and easily understandable column descriptions.
Query Complexity: Both models perform well across simple to complex queries, with GPT-4o showing a slight advantage in accuracy for more complex tasks.
Performance and Efficiency: GPT-4o consistently outperforms Sonnet in query generation speed (42.4% faster for complex queries) and token efficiency (34.3% fewer tokens used).
Interactive Queries: GPT-4o performs better in handling follow-up questions and maintaining context.

While Sonnet 3.5 proves to be a strong contender, GPT-4o demonstrates an slight overall edge in performance, accuracy and cost effectiveness. For many use cases the results are, however, close enough to relegate the choice to other concerns (availability, security & privacy, etc).

Waii

Here at Waii, we continue to provide the best text-to-SQL APIs to the data community. We are excited about the potential of this technology and will continue to work towards refining its capabilities and promoting best practices in fine-tuning and query generation. Join our Slack community if you’re interested in collaborating with us.

This post was written by Rachana Aluri and Rajesh Balamohan

Text-to-SQL Excellence: An Evaluation of Sonnet 3.5 and GPT-4o

Key factors in evaluating LLM for Text-to-SQL

Knowledge Graph Curation

Query Generation

Evaluation criteria

Datasets considered

Benchmark details and results:

Table selection benchmark:

Column filtering benchmark:

Table/Column description benchmark:

Query complexity:

Simple Questions:

Medium Questions

Complex Questions

Interactive query generation

Cost factor

Future work

Conclusion

Waii

Written by Rachana Aluri