Challenges in Enterprise-Level LLM-Based Natural Language to SQL (NL2SQL)

6 min readMay 10, 2024

Networks on chips (AI Generaed Image on Stable Diffusion)

Natural Language to SQL (NL2SQL) is one of the most promissing applications of Large Language Models. In today’s data-centric world, abundant data is structured in table form. Tasks like database queries, spreadsheet computations, or creating reports from web tables are routine for pretty much everyone.

At this point, there’s no doubt that LLMs know how to write SQL to a pretty high-level of accuracy. LLMs have shown state-of-the-art performance in several Text-to-SQL benchmarks such as Spider¹ and Bird²

However, their implications relating to enterprise SQL databases remain very questionable.

In this first article, I will delve into the results of utilizing Vanilla GPT-4 in a zero-shot setting to generate SQL queries from business questions using a typical orders table.

Experiments Set-up

The queries were executed on an Orders Table consisting of 16 columns. It features both categorical and numerical data, mirroring a standard orders table typically used in e-commerce companies.

The prompt used originated from the Open AI text-to-SQL demonstration in a zero-shot setting, utilizing the GPT-4 Turbo model.

### Generate Google Big Query SQL query only and with no explanation
### Here is the SQL table, with their properties:
#
# orders_table('order_id', 'purchase_date', 
'buyer_email', 'market', 'child_asin', 'e_conomic_number', 
'product_marketing_category', 'product_name', 'product_pack', 
'product_and_pack', 'product_category', 'product_type', 'product_size', 
'product_colour', 'gross_sales', 'units_sold')
#
### What was the total revenue generated yesterday?

The questions asked mainly encompassed common used financial metrics (sales, units sold, number of orders, and basket analysis) and growth metrics (Year-over-Year and Week-over-Week). Below are some of the posed questions. These questions differed in their complexity levels; some only required familiarity with the column name, while others required understanding of growth metrics and the ability to accurately map them to a product category.

What was the total revenue generated yesterday?
How many units of Headwear were sold in the previous 4 days?
What is the YoY sales growth of Male Underwear for each month last year?

A Streamlit app prototype was set-up, and all the question-and-answer pairs were logged using LangSmith.

The results was then classified as follows:

Schema Linking: Incorrect mapping of the user’s question to the database column. For instance, a request for “product_name” yields “product_category”.
Question Misinterpretation: Misunderstanding of the user’s intent. For example, asking for “last month” but getting data for the last 30 days.
Context Misinterpretation: Incorrect representation of context. For instance, asking for “n orders” but receiving the number of lines.
Syntax Error: Failure of the system to generate a functioning query.
Faulty Logic: Sub-optimal query logic potentially leading to errors. For instance, using an incorrect method to calculate Year over Year (YoY).
Correct: The query was executed correctly and functioned as expected.

LLMs struggle in all areas of NL2SQL

Performance of ChatGPT4 on the custom eval. (Image by Author)

The results reveal that GPT-4 struggled significantly in answering the questions correctly presenting only 15.6% of correctness.

1. Schema Linking

As anticipated, given the simplicity of the prompt which lacked any column mapping and only included column names, schema_linking had the highest error rate at 22%. This error resulted from the fact that multiple questions referenced a specific product, which consequently confused the LLM in mapping to the correct product_x column.

Core Conclusion 1: We need to inform the LLM about the column it should map to. However, due to the limitation of the context window, we can’t parse a mapping of all possible key-value pairs. Multiple avenues have been explored, mainly Retrieval Augmented Generation³

2. Question misinterpretation

question_misinterpreation was another category that the model struggled significantly. For some obscure reason, the model struggles in interpreting “last x date”. For example “last month” - instead of selecting the previous completed month, it selects the previous 30 days which lead to a completely different response. This happened consistently across multiple questions.

Question: What was the most popular product category sold last month?

SELECT product_category, SUM(units_sold) AS total_units_sold
FROM orders.table
WHERE 
	purchase_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH) AND CURRENT_DATE()
GROUP BY product_category
ORDER BY total_units_sold DESC
LIMIT 1;

The interesting part is that if you ask ChatGPT through the browser, you will get a correct representation of “last month”.

Core Conclusion 2: LLMs are inconsistent in interpreting date ranges;

3. Faulty Logic

LLMs know how to write SQL, but what about robust SQL code? The results show that it struggled significantly in doing so — and this is an area that more information about the columns wouldn’t help.

For example, when asking about any running sum/logic (e.g. WoW, YoY), the model often defaults to using LAG functions. While this may seem logical at first glance, the LAG function can yield incorrect results when dealing with incomplete data, as it disregards NULL values.

In the example below, the calculation results in an error. Although we’re interested in the immediate week (WoW), if the result of the previous week is null, it would jump to the first non-null result. This approach can be disastrous for such a metric.

Question: What was the WoW sales performance of Running Wear in the UK and Germany during the last three months?

-- [truncated for simplicity]
week_over_week AS (
  SELECT
    market,
    week_start_date,
    total_gross_sales,
    total_units_sold,
    LAG(total_gross_sales, 1) 
        OVER(PARTITION BY market ORDER BY week_start_date) AS prev_week_gross_sales,
    LAG(total_units_sold, 1) 
        OVER(PARTITION BY market ORDER BY week_start_date) AS prev_week_units_sold
  FROM
    weekly_sales
)
-- [truncated for simplicity]

Core Conclusion 3: Even though LLMs know how to code in SQL, they struggle to output robust code. This is especially critical for sales metrics such as Year-over-Year (YoY) and other period-over-period calculations.

4. Syntax Error

The model consistently outputs queries with syntax errors, despite the questions being relatively straightforward and only requiring the use of one table. On several occasions, the model made up functions even though the prompt specified the use of Google Big Query Syntax.

Question: Can you report the total number of orders placed during the last weekend?

SELECT COUNT(*) AS total_orders_last_weekend
FROM `orders_table`
WHERE purchase_date 
  BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL WEEKDAY(CURRENT_DATE())+2 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL WEEKDAY(CURRENT_DATE())-1 DAY);

A query error occurred: Function not found: WEEKDAY at [3:63] at location query.

Core Conclusion 4: Despite GPT 4’s strong performance with complicated questions, as evidenced by multiple benchmarks, it encounters issues when answering simple questions.

5. Context Misinterpretation

Lastly, the model failed to recognize the context of certain metrics. For example, basket size should be calculated using the formula SUM(units_sold)/COUNT(DISTINCT order_id), but the model incorrectly used the AVG function. This error occurred due to a lack of understanding of the table structure, which is organized based on the order-item line and not order line.

Question: What average basket size last month?

SELECT AVG(units_sold) AS average_basket_size
FROM `orders_table`
WHERE purchase_date 
  BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH)
  AND LAST_DAY(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH))

Core Conclusion 5: Although large language models (LLMs) have demonstrated remarkable ability to compute complex metrics, they often struggle with interpreting basic metrics.

The naive approach is far from sufficient.

Serving analytics is a complex and sensitive matter. A minor error in the SQL code can lead to significantly misleading insights. It’s evident that the naive approach (schema + question) is not enough, particularly when high confidence (well above 15.6%) is required to deploy a software that impacts decision making. In the upcoming articles, I will talk about the strategies I’ve employed to improve the accuracy and steerability of LLM based NL2SQL frameworks, which include:

Semantic Routing;
Addressing Data and Query Ambiguity;
The necessity of a Metrics Layer;
AI Steerability;

References

[1] — Pourreza, M., & Rafiei, D. (2023). DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. http://arxiv.org/abs/2304.11015;

[2] — Xia, H., Jiang, F., Deng, N., Wang, C., Zhao, G., Mihalcea, R., & Zhang, Y. (2024). SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning. http://arxiv.org/abs/2402.14851;

[3] — Guo, C., Tian, Z., Tang, J., Li, S., Wen, Z., Wang, K., & Wang, T. (2023). Retrieval-augmented GPT-3.5-based Text-to-SQL Framework with Sample-aware Prompting and Dynamic Revision Chain. http://arxiv.org/abs/2307.05074;

Reach out

LinkedIn: https://www.linkedin.com/in/giovannigg/
e-mail: guerreiro.giovanni@gmail.com