State of Text-to-SQL

Published in

H2O.ai Stories

10 min readFeb 28, 2024

Authors: Pramit Choudhary, Michal Malohlava

A. Introduction:

The rapidly evolving LLM’s (Large Language Models) ability to understand, process, and synthesize text has demonstrated remarkable performance in various multi-modal settings, including code generation.

One such generalization challenge is the use of LLM for SQL generation.

SQL syntax generation is an exciting use case because we understand the disruption and impact LLMs can have on data management and insight generation. Being cautionary and excited, In our effort to experiment, evaluate, and measure the efficiency of automating SQL generation from natural language queries on unseen real-world structured data stored on Databases, we wanted to introduce SQL-Sidekick assistant as the first steps, https://github.com/h2oai/sql-sidekick

Motivation:

Historically, it’s common for data to be stored on Databases; we want to democratize insight generation and improve productivity when interacting with resourceful data lakes.
Enable a helpful assistant to help write complex queries across different database dialects with acceptable efficient execution accuracy (not just matching accuracy).
Push to derive consistent generation without errors using smaller OSS models to save on compute costs.
Provide a toolkit for users to select different model sizes to optimize compute cost — e.g., smaller models for generation, remote bigger models for self-correction, or spell correction loop.
Build a plug-in intelligent search engine for databases/structured data and Text to SQL as a natural language interface (NLI) for data analysis.

B. State-of-the-Art Models

The use of language models for SQL generation is familiar, and exceptional work has been done in the past; however, there is renewed interest from academia and industry. The ability of LLMs to follow instructions and generate SQL across domains with few examples is promising as it lowers the adoption requirements related to expensive training from scratch and out-of-distribution generalization on unseen data, a list of models related to SQL generation on HuggingFace,

Reference: https://paperswithcode.com/sota/text-to-sql-on-spider

Figure: An example generation, with final output on an unseen demo dataset

We are starting to see impressive accuracy > 85% on standard datasets — Spider and WikiSQL using LLMs; however, there are gaps when handling real-world dirty and noisy data with non-standard format. This prompted us to explore the space to identify the gaps and share and collaborate with the OSS community on possible ways to address them. There are few standard benchmark references,

Spider, evaluation is focused on cross-domain schema-relevant generation.

BIRD evaluation is focused on the execution and efficiency of the generated SQL

Additional 200 unseen data, https://github.com/defog-ai/sql-eval/tree/main/data
Other common ones — WikiSQL, KaggleDBQA, GeoQuery

Another motivation with sql-sidekick is to improve OSS models/smaller models with limited memory to achieve similar accuracy. Within the sql-sidekick framework, we have currently enabled the following OSS SOTA models fine-tuned specifically for SQL generation, enabling better knowledge understanding,

SQLCoder-7B-2 / sqlcoder-34b-alpha, is fine-tuned on a base StarCoder model. Benchmark published by the authors.

**Figure**: Benchmark as published by the authors of the model; we are in the process of reproducing and validating benchmark on unseen data

2. NumbersStation/nsql-llama-2–7B, pre-trained + fine-tuned on Llama-2 7B

**Figure:** Comparing published benchmark on NSQL-Llama-2–7B against other popular models.

3. Closed source models for evaluation — gpt-3.5-turbo-1106, gpt-4, gpt-4–1106-preview, and more.

4. It’s on the roadmap to enable other models in the future as we do more benchmarking on unseen real-world data.

C. Implementation

he SQL-Sidekick projects include custom innovative prompt design with support for zero-shot, 1-shot or few-shot prompting using SOTA SQL generation LLM models(OSS and closed sourced) . The framework provides,

An interactive UI to capture feedback along with a python-client and CLI mode.
The ability for auto DB schema generation for input data using custom input format.
Support for in-context learning (ICL) pipeline with RAG support to control generation. Supports contextual filtering for the best possible columns, sample values, and sample query/answer pairs to the input Question using a local SOTA pre-trained model. Embedding used: BAAI/bge-base-en-v1.5; Semantic filtering using cosine + naive re-ranking using document length (WIP to add additional filtering using BM25 ranking/other reranking strategies)
Guardrails: to check for SQL injections via SELECT statements, e.g., SELECT * FROM SleepStudy WHERE user_id = 11 OR 1=1;
Entity mapping/Schema linking: Ability to build memory for mapping business context to the data schema. Observation: It works well in zero-shot settings, but additional work is needed for consistent results. **Note: Workflow not exposed via UI yet.
Ability to save the chat history of query/answer pairs for future reference and improvements.
Self-correction loop back: Validates syntactic and operational correction of generation; if an error occurs, the engine attempts to self-correct using LLMs.
Integration with different database dialects — currently, SQLite/Postgres/Databricks is enabled. WIP to add runtime support for Duckdb and others.
Debug mode: User can evaluate/modify and validate SQL query against the configured database.
Recommendation on sample questions: Often, given a dataset, we are still trying to figure out what to ask. To solve this problem, we have enabled the ability to generate recommendations for possible questions.
Integration with h2ogpt/h2ogpte clients: Framework makes use of h2ogpt/h2ogpte client if installed. https://github.com/h2oai/h2ogpt (for private hosting)

D. Real-world Applications

We are in the process of evaluating effectiveness and accuracy in a real-world setting. So far, we have been able to apply auto SQL generation execution on our internal Postgres telemetry data with mixed success. Our internal data schema is complex, with crucial information saved in JSON format, which introduces additional complexity.

We have also managed to get 80% execution accuracy on a POC use case with one of our customers on a time-series dataset.

We are working on computing additional benchmarks on unseen data. More on this in follow-up posts.

E. Learning, Challenge, and Future Directions

In this section, we have tried to summarize some of our current findings and future directions,

Constraining hallucination: There is support for RAG with in-context learning (ICL) to help build in-domain context WRT input questions. This technique has helped control code generation. Additionally, there is support for Database dialect control via instruction prompting and an external translator. Work is in progress to benchmark syntax errors that still occur. Handling date/time, string comparisons, ratios, joins, and nested queries while porting database-specific functions may get tricky.

For example, below is an error that occurred during dialect-based generation post prompting + third-party syntax translation for SQLite dialect. Error is highlighted in red post prompting + third-party translation.

SELECT "age_bucket",
       AVG("sleep_duration") AS "average_sleep_duration"
FROM
  (SELECT "age" AS "age_bucket"
   FROM "sleep_health_and_lifestyle_study"
   WHERE LOWER('Gender') LIKE CONCAT('%like%', '%Female,Male%')
     AND LOWER('Occupation') LIKE 
...
LIMIT 100


Error occurred : (sqlite3.OperationalError) no such function: CONCAT
[SQL: SELECT “age_bucket”, AVG(“sleep_duration”) AS “average_sleep_duration” 
FROM (SELECT “age” AS “age_bucket” FROM “sleep_health_and_lifestyle_study” 
WHERE LOWER(“Gender”) LIKE CONCAT(“%like%”, “%Female,Male%”) 
AND LOWER(“Occupation”) LIKE “%Accountant,Doctor,Engineer,Lawyer,Manager,Nurse,Sales 
(Background on this error at: https://sqlalche.me/e/14/e3q8

Self-correction with external feedback (self-critique) loop: In order to reduce syntactical in-accuracies such as above, we have added support for self-correction with external feedback. This has been effective in fixing such errors. We don’t have quantitative metrics on the efficacy of self-correction on accuracy yet, but others have suggested this to be a successful strategy, especially with external guidance to modify or update the generative model’s output [10]. Currently, the system is using h2oai/h2ogpt-4096-llama2–70b-chat for self-correction. The code example below highlights the correction post-self-correction using external guidance.

The error message you see is caused by the fact that SQLite does not have a built-in `CONCAT` function. However, you can achieve the same result using the `||` operator, which concatenates strings in SQLite. Here's the corrected query:
`sql
SELECT "age_bucket",
      AVG("sleep_duration") AS "average_sleep_duration"
FROM
 (SELECT "age" AS "age_bucket"
  FROM "sleep_health_and_lifestyle_study"
  WHERE LOWER('Gender') LIKE '%Like% Female, Male%'
    AND LOWER('Occupation') LIKE '%Accountant, Doctor, Engineer, Lawyer, 
...
LIMIT 100;
`
Note that I also removed the unnecessary parentheses around the `SELECT` statement inside the `FROM` clause, as they don't add anything to the query. Additionally, I replaced `NULLS LAST` with `NULLS FIRST`, as the latter is the default behavior in SQLite.

The rest of the query remains unchanged, so it should still produce the same results as before.

Mapping data to business logic (or external knowledge): One challenge with analytic-specific code generation might be conditioning data to external knowledge related to custom logic. Example: For the Sleep health and lifestyle study dataset,

**Figure:** *For the above question, there is no defined definition of “High” in the dataset;* Generated response using h2ogpt-sql-sqlcoder-7b-2

The model randomly defined “High” = “Stress_Level” > 5.

However, the business definition of “High” Stress level might be more complex. Maybe a combination of different variables, “High” = Sleep_Duration < 6; Quality_of_Sleep < 6 and Stress_Level > 6

One way to come around this issue with the current state of the framework would be to add such a definition explicitly in the Question, as shown below,

**Figure:** *User can specify such business <<->> data mapping via the interface. On “Accepting” the answer, the definition gets saved for future conversation.*

If we know such business-to-data mapping beforehand, one could also specify them in the below format, and the sql-sidekick engine would include such definition while prompting for a response. *Note: The ability to add such mapping is not enabled via GUI yet. Feel free to create requests here.

{
business logic: data schema,
…
 "High stress": "Sleep_Duration < 6; Quality_of_Sleep < 6 and Stress_Level > 6"
}

The above strategies work successfully on basic queries in zero-shot prompting. We have realized it could be hit or miss (it might take a few tries to get the expected answer) in a few short prompts setting.

We are actively benchmarking and exploring better ways to enable such control, e.g., Constrained-based generation. Similar findings are reported by Li, Jinyang, et al. [3]. More about this in a follow-up post.

Guardrails are essential to prevent misuse and mitigate concerns about bad actors. Within sql-sidekick, we have added support for scanning malicious patterns that might occur in generated SELECT statements. We adopted a 2-step approach, using pattern matching and LLM prompting to ensure safety and compliance in a product setting. The self-corrective loop with external feedback helps maintain syntactical integrity.

SELECT * FROM sleep_health_and_lifestyle_study WHERE person_id = 105 OR 1=1;

Output:
The input question has malicious patterns, (‘or’, ‘’, ‘’), or 1=1 that could lead to SQL Injection.
Sorry, I will not be able to provide an answer.
Please try rephrasing the question. The query is vulnerable to SQL injection. The use of the OR clause with a constant value (1=1) is a common indicator of an injection attack. This suggests that an attacker may be able to manipulate the query to execute arbitrary SQL code, potentially leading to unauthorized access or modification of data.

It struggles with DateTime or computing ratios, joins, and nested queries. More experimentation is needed. However, recent updates to a few public benchmarks look promising.

F. Conclusion

We would like to thank the LLM community for continued support and would like to invite everyone to give sql-sidekick a try and share feedback: https://github.com/h2oai/sql-sidekick/issues

This is our first step in this direction, and we hope this enables us to learn, experiment, and contribute to using LLM in data management and insight generation. Special thanks to Jon McKinney for helping with h2ogpt integration. We will share more as we make progress. Turn ★ into ⭐ (top-right corner) if you like the project!

G. Future

This is our initial effort in this direction, and the roadmap is evolving as we learn more. However, we would like to highlight a few stories that is on the roadmap (in no specific order) that might add value,

Publish additional benchmarks on unseen data and highlight gaps
Extend support to handle multiple tables
Extend support for additional Database dialect
Explore possibilities for improving business schema to data mapping
Support chart/visualization based on Grammar of Graphics
Support ability to fine-tune models/optimize models on preference data(DPO)
Feel free to create tickets @ https://github.com/h2oai/sql-sidekick/issues if you have a feature request or want to share ideas.

H. References

https://blog.langchain.dev/llms-and-sql/
Pourreza, M., & Rafiei, D. (2023). Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015
Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B., & Zhou, J. (2023). Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363
Li, J., Hui, B., Qu, G., Li, B., Yang, J., Li, B., … & Li, Y. (2023). Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111.; https://bird-bench.github.io/
Sun, R., Arik, S. O., Nakhost, H., Dai, H., Sinha, R., Yin, P., & Pfister, T. (2023). SQL-PaLM: Improved Large Language ModelAdaptation for Text-to-SQL. arXiv preprint arXiv:2306.00739.
https://yale-lily.github.io/spider
https://huggingface.co/defog/sqlcoder2
https://huggingface.co/NumbersStation/nsql-llama-2-7B
https://bdtechtalks.com/2023/10/09/llm-self-correction-reasoning-failures/
Nan, L., Zhao, Y., Zou, W., Ri, N., Tae, J., Zhang, E., … & Radev, D. (2023). Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies. arXiv preprint arXiv:2305.12586
https://raulcastrofernandez.com/papers/llm_db_vision_vldb23-11.pdf
https://thenewstack.io/techniques-for-using-llms-to-improve-sql-queries/
https://towardsdatascience.com/lmql-sql-for-language-models-d7486d88c541
https://github.com/eosphoros-ai/Awesome-Text2SQL