Translating Natural Language into SQL: GPT-4 vs Claude 3 Showdown

Rajesh Balamohan
Waii
Published in
7 min readMar 15, 2024

--

Anthropic recently released the Claude 3 model (Introducing the next generation of Claude Anthropic), boasting impressive benchmarks compared to other LLMs. At Waii.ai we are particularly interested in one aspect of the model: Natural language to SQL translation.

Despite multiple attempts by small and large models to change this, GPT-4 (1106 model) has been the best ranking model in this space in our testing since its release. We are thus curious about how Claude 3 compares to the incumbent.

For this evaluation, we've selected a number of different datasets, including Spider and TPC-DS datasets as we have done before (See: GPT-4 SQL mastery, GPT-4 tpcds). Spider is renowned for its comprehensive SQL query generation challenges across various domains. TPC-DS is gold standard for evaluating decision support systems. Our goal is to see how well each model understands and translates natural language into SQL queries, using accuracy and query generation time as our key metrics.

Evaluation Metrics

  • Accuracy: We gauge how accurately the LLMs create SQL queries that align with the expected structure and logic. By comparing the generated queries with ground truth query, we measure the models' precision. A higher accuracy rate reflects a model's ability to comprehend and translate complex instructions from natural language to SQL. Additional information about the metric can be found in CIDR paper and archerfish-benchmark
  • Query Generation Time: This represents the time taken by the LLM to generate the SQL.

Results Overview:

Simple Queries:

We chose a subset of spider as our testbed for simpler queries. Spider is a well-known benchmark, it has a variety of different databases, and is readily available. Datasets typically only have few tables, and the golden queries are relatively simple (few joins, few aggregates, etc). We sampled 384 queries from a variety of schemas, including battle_death, concert_singer, employee_hire_evaluation, flight, museum_visit, orchestra, poker_player, singer, tvshow, and voter. This ensures a comprehensive assessment across different domains.

Graph clearly shows that Claude 3's performance is remarkably similar to that of GPT-4's 1106 model. This is the first time that we’ve seen a contender coming this close to GPT-4 performance. That’s a significant milestone. After examining the inaccuracies in the queries generated by Claude 3, we categorized the 26 errors out of the 384 queries into the following buckets:

  • Incorrect interpretation of fields
  • Confusion over AND/OR conditions
  • Errors in understanding join conditions
  • Omissions of necessary filters

Let's dive into a specific instance where Claude 3 misinterpreted the fields, leading to incorrect join conditions.

List unique name and date, the battle that has lost the ship named 'Lettice' and the ship named 'HMS Atalanta'

Generated query:
================

WITH SHIP_IDS AS
(SELECT DISTINCT ID
FROM BATTLE_DEATH.SHIP
WHERE NAME IN ('Lettice','HMS Atalanta') ),
BATTLE_IDS AS
(SELECT DISTINCT DEATH.CAUSED_BY_SHIP_ID AS SHIP_ID
FROM BATTLE_DEATH.DEATH
WHERE DEATH.CAUSED_BY_SHIP_ID IN
(SELECT SHIP_IDS.ID
FROM SHIP_IDS) )
SELECT DISTINCT BATTLE.NAME,
BATTLE.DATE
FROM BATTLE_DEATH.BATTLE
INNER JOIN BATTLE_IDS ON BATTLE.ID = BATTLE_IDS.SHIP_ID

In this query, Claude 3 incorrectly joined battle.id with battle_ids.ship_id, which are not related fields. Waii automatically provides information to LLM on performing joins. The accurate join condition for this LLM generated query, should link the ship table to the battle table through ship.lost_in_battle and battle.id, as accurately depicted in the ground truth query.

Ground truth query:
===================

SELECT DISTINCT T1.NAME,
T1.DATE
FROM SPIDER_DEV.BATTLE_DEATH.BATTLE AS T1
JOIN SPIDER_DEV.BATTLE_DEATH.SHIP AS T2 ON T1.ID = T2.LOST_IN_BATTLE
WHERE T2.NAME = 'Lettice' INTERSECT
SELECT T1.NAME,
T1.DATE
FROM SPIDER_DEV.BATTLE_DEATH.BATTLE AS T1
JOIN SPIDER_DEV.BATTLE_DEATH.SHIP AS T2 ON T1.ID = T2.LOST_IN_BATTLE WHERE T2.NAME = 'HMS Atalanta'

Query Generation time:

Claud 3 had ~28% higher query generation time compared to GPT-4.

Medium Queries:

In order to raise the complexity a bit, we picked a ‘raw’ dataset: tv_data. This dataset captures information about movies, television series and the people working on them. The overall number of tables and columns in this set is fairly small, but this dataset’s main characteristic is that much of the data resides in heavily nested json fields. We looked at a sample of 29 queries to see how well Claude 3 would manage them. In this test, Claude 3 successfully handled 68.97% of the queries, while GPT-4 had a success rate of 93.1%

We categorized the errors from Claude 3 into the following buckets:

  • related to join condition
  • related to where clause/condition
  • related to misunderstanding of question
  • related to misinterpretation of complex field

Let us take one of the example queries where it failed in Claude 3.

List all movies directed by "David Slade" along with their runtime.

Generated query:
================
SELECT "Asset Title",
"Runtime"
FROM CINE_TELE_DATA.MOVIES
WHERE 'David Slade' IN
(SELECT VALUE
FROM CINE_TELE_DATA.MOVIES
CROSS JOIN LATERAL FLATTEN(INPUT => MOVIES."Directors"))

The generated query searches for 'David Slade' using an uncorrelated subquery. This results in all rows incorrectly being returned from movies table when there is a match for one row. Ground truth query is given below.

Ground truth query:
==================

SELECT "Asset Title",
"Runtime"
FROM CINE_DB.CINE_TELE_DATA.MOVIES AS M
CROSS JOIN LATERAL FLATTEN(INPUT => "Directors") AS F WHERE F.VALUE ILIKE '%David Slade%'

Query Generation Time:

From query generation perspective, Claude 3 took 64% higher time than GPT-4 model.

Complex Queries:

For the most complex queries we used the TPCDS dataset. The database is laid out as a well-modeled star schema, and typical queries require 5-10 joins, multiple subqueries, aggregates and window functions.

In our examination of the TPCDS dataset, we closely compared Claude 3's query generation accuracy against GPT-4. We evaluated 32 queries.

We analyzed the results and here are the buckets in which Claude 3 generated queries had issues with:

  • Misinterpretations of the question leading to incorrect SQL
  • Omissions of dimension tables, resulting in inaccurate queries
  • Missing filters
  • Errors in group by statements and joins

Let us discuss an example that illustrates one such query.

Generate a report listing the top 100 unique items, including their IDs, descriptions, and current prices, which were in the inventory during a specific 60-day period starting from '2000-05-25'. These items should have been sold in store sales channel and should have been manufactured by specific manufacturers (129, 270, 821, 423) with item's price in the range of 62 and 92. Quantity of these items in the inventory should be between 100 and 500. Sort distinct results by item's id. Display item's id, item description and its current price. Show only 100 rows.

Generated query:
================

WITH ITEM_FILTERED AS
(SELECT I_ITEM_ID,
I_ITEM_DESC,
I_CURRENT_PRICE,
I_ITEM_SK
FROM SAMPLE_TPCDS.ITEM
WHERE I_MANUFACT_ID IN (129,270,821,423)
AND I_CURRENT_PRICE BETWEEN 62 AND 92 ),
INVENTORY_FILTERED AS
(SELECT INVENTORY.INV_ITEM_SK,
INVENTORY.INV_QUANTITY_ON_HAND
FROM SAMPLE_TPCDS.INVENTORY
WHERE INVENTORY.INV_DATE_SK BETWEEN 2451699 AND 2451759
AND INVENTORY.INV_QUANTITY_ON_HAND BETWEEN 100 AND 500 ),
STORE_SALES_FILTERED AS
(SELECT STORE_SALES.SS_ITEM_SK
FROM SAMPLE_TPCDS.STORE_SALES)
SELECT I.I_ITEM_ID,
I.I_ITEM_DESC,
I.I_CURRENT_PRICE
FROM ITEM_FILTERED AS I
INNER JOIN INVENTORY_FILTERED AS INV ON I.I_ITEM_SK = INV.INV_ITEM_SK
INNER JOIN STORE_SALES_FILTERED AS SS ON I.I_ITEM_SK = SS.SS_ITEM_SK
ORDER BY I.I_ITEM_ID
LIMIT 100;

Claude 3 generated query failed to incorporate the date_dim table and its related filters in the query. It uses surrogate keys that mistakenly represent the year 2000, rather than the intended 60-days period starting from specific date. This is not too surprising given the training set. Many benchmarks have been written based on TPCDS with fixed surrogate keys and this pattern is used here. Waii explicitly instructs the model to do the date computation a certain way, but that was ignored by Claude. It seems that GPT-4 still has a slight edge in instruction following.

There is also an issue on grouping, but that can be fixed by adding distinct keyword to the generated query. But missing the date_dim causes query output to be different than ground truth query which is given below.

Ground truth query:
===================

SELECT I_ITEM_ID,
I_ITEM_DESC,
I_CURRENT_PRICE
FROM TWEAKIT_PERF_DB.SAMPLE_TPCDS.ITEM,
TWEAKIT_PERF_DB.SAMPLE_TPCDS.INVENTORY,
TWEAKIT_PERF_DB.SAMPLE_TPCDS.DATE_DIM,
TWEAKIT_PERF_DB.SAMPLE_TPCDS.STORE_SALES
WHERE I_CURRENT_PRICE BETWEEN 62 AND 62 + 30
AND INV_ITEM_SK = I_ITEM_SK
AND D_DATE_SK = INV_DATE_SK
AND D_DATE BETWEEN '2000-05-25' AND '2000-07-24'
AND I_MANUFACT_ID IN (129,270,821,423)
AND INV_QUANTITY_ON_HAND BETWEEN 100 AND 500
AND SS_ITEM_SK = I_ITEM_SK
GROUP BY I_ITEM_ID,
I_ITEM_DESC,
I_CURRENT_PRICE
ORDER BY I_ITEM_ID
LIMIT 100

Query Generation Time:

W.r.t average query generation time for more complex queries: Claude 3 takes ~25.2% more query generation time than GPT-4.

Conclusion:

In summary, Claude 3 shows strong natural-language-to-sql performance, compared to most other LLMs we looked at in the past. Congratulations to the Anthropic team, this is a remarkable achievement.

The reigning champion, GPT-4 still has an advantage, in both more complex queries as well as overall generation speed. However, the margin is closing and for simpler SQL workloads Claude 3 is a viable alternative.

Having models like Claude 3 and GPT-4 is essential for pushing forward the boundaries of what's possible with natural language to SQL processing with LLMs. It not only drives improvements in model accuracy and efficiency but also provides users with a variety of tools that better fit their specific needs.

--

--