Enhancing Database Querying Using Large Language Models

Rosemary J Thomas, PhD
Version 1
Published in
9 min readSep 7, 2023
Created using Microsoft Bing Image Creator

Pre-trained large language models (LLMs) are being used to generate code based on natural language instructions. However, these models often generate code that doesn’t follow the syntax and rules of the programming language, limiting their practicality. Database querying is a time-consuming and error-prone task, often requiring experts due to its complexity. Attempts to develop systems for non-experts to query databases directly have shown potential but lack widespread implementation.

Advancements in LLMs like Codex, ChatGPT, and GPT have impacted AI, including tasks like converting text to SQL queries (Text-to-SQL). While LLMs show promise in generating SQL queries, poorly designed prompts hamper performance and lead to irrelevant results. Current methods use limited context from data to guide LLMs, but more sophisticated context usage is needed. Now let us explore the recent studies that investigated to mitigate these challenges.

Reasoning LLM based framework

a) Table-based Reasoning framework

Dater (Ye et al.): combined with GPT-3 Codex model employed a self-consistent decoding strategy. It’s a framework for reasoning on tables involving 4 steps: in-context learning, evidence decomposing, question decomposing and joint reasoning. The test datasets and evaluation metrics used are in the table below:

Table 1. Overview of the datasets used, tasks, metric, performance, improvement and achievements for the different models

The Dater framework effectively improved reasoning abilities for table-based tasks, demonstrating its potential for enhancing language models’ performance on complex tabular data. It would be interesting to investigate how the decomposers will perform on a larger dataset.

b) Case-Based Reasoning-based framework

CBR-ApSQL (Guo et al.) combined with GPT-3.5 leverages the capabilities of LLMs to manage pertinent and non-pertinent knowledge for cross-domain Text-to-SQL tasks using dynamic prompting. The approach consists of three main steps: retrieve, reuse and revise and expands beyond the Text-to-SQL prompt limitations to enhance results.

The experiments use three public benchmark datasets:

Spider (Yu et al.): Text-to-SQL benchmark spanning 138 different domains with 200 databases.

Spider-Syn: variant that challenges models with synonym substitutions, removing explicit alignment between questions and database schema.

Spider-DK: variant with artificially added domain knowledge.

The evaluation process focuses on three key metrics: valid or error-free SQL execution (VA), execution result accuracy (EX), and test-suite accuracy/effectiveness for code coverage (TS), prioritizing semantic accuracy over exact match accuracy (EM) due to variation in expressing SQL queries.

CBR-ApSQL does really well on most tasks with the T5–3B models, but there’s one model called RESDSQL-3B (tested on the Spider dataset) that’s marginally better (only 0.1% difference). When we look at the valid or error-free SQL execution (VA) metric it got a score of 99.1 in the Spider dataset (see table below). CBR-ApSQL also improved how well it understands and uses language for both the meaning and structure of language. This is a strong contender for Text-to-SQL tasks.

Figure 1. Comparison of the performance of CBR-ApSQL vs other models

CBR-ApSQL’s adaptive retrieval mechanism (Hyperbolic SDRE or Semantic Domain Relevance Evaluator) performs best across datasets and metrics (see table below). The adaptive retrieval enhances semantic relevance, leading to better SQL generation and performance. CBR-ApSQL’s de-semantic approach mitigates complexity impacts in the Spider-DK dataset by reducing irrelevant knowledge effects.

Figure 2. Comparison of SQL generation performance using various strategies to create prompts for LLM

The model seems to be tested rigorously on a small dataset of 200 databases. The researchers have demonstrated the enhanced accuracy changes using CBR-ApSQL’s and its adaptive retrieval mechanism. Additional validation and assessment are required to substantiate if the results remain accurate on a larger dataset.

LLM-based Text-to-SQL model

SQL-PaLM (Sun et al.) is a framework for enhancing the performance of LLMs, such as PaLM-2 for the task of Text-to-SQL conversion. The framework leverages both few-shot adaptation and fine-tuning to improve the model’s performance on this specific task. This approach involves training the model to generate SQL queries based on demonstrations and context, while also refining its performance through domain-specific fine-tuning.

The experiments use four public benchmark datasets. Three similar to the above-mentioned model CBR-ApSQL i.e., Spider, Spider-Syn, Spider-DK and Spider-realistic which includes 508 text-SQL pairs from Spider Dev Split, with removed explicit column name mentions.

Two common evaluation metrics are used: execution accuracy (EX) and test-suite accuracy (TS). EX measures SQL execution matching ground truth, while TS measures passing multiple EX evaluations for various tests.

The proposed Few-shot SQL-PaLM model achieves strong performance on Text-to-SQL tasks compared to state-of-the-art methods. The Few-shot SQL-PaLM model attains 77.3% test-suite accuracy on the Spider dev split. This is 3.8% higher than the current fine-tuned state-of-the-art (SOTA — RESDSQL-3B + NatSQL) and 3.1% higher than a recent in-context learning SOTA (DIN-SQL (w/ GPT-4)) even though it’s trained on simpler datasets and uses a basic prompt. It also performs better than models using composite prompts for sub-tasks.

In the tables below, you can see how Few-shot SQL-PaLM outperforms other models and the differences in their performance. You can also see SQL-PaLM’s test-suite accuracy on the Spider dev split as well as the comparison between SQL-PaLM’s test-suite accuracy and execution accuracy compared to the RESDSQL model.

Table 2. Performance margin comparisons across models
Table 3. Test Suite accuracy comparisons on Spider Dev Split
Table 4. Execution and TestSuite accuracy comparison across models

It surpasses other fine-tuned models and demonstrates robustness and performance across different difficulty levels from easy to extra hard averaging TS accuracy of 77.3% for Few-shot and 78.2 for Fine-tuned SQL-PaLM. The analysis of different prompt design approaches shows that Few-shot SQL-PaLM performs better with “concise” prompts (EX 82.7%, TS 77.3%, see table below). Additionally, Few-shot performance is better than zero-shot, with a small performance gap observed.

Figure 3. Test-suite accuracy for different prompt design approaches in zero- and few-shot set-up on Spider Dev.

A qualitative analysis among human experts suggested that many examples classified as “errors” by the model were correct. This supports that the model’s performance might be even higher. The evaluation sometimes fails due to vague natural questions or the difficulty of evaluating creative solutions that deviate from the ground truth. For instance, some examples have multiple valid ground truths, experience type-related issues, or present different acceptable formats. The errors made by Few-shot SQL-PaLM are very similar to those that human experts would make, rather than simple syntax errors. This demonstrates the model’s deep-rooted understanding of SQL. Few-shot SQL-PaLM efficiently handles tasks like joining multiple tables using various keywords and employing nested SQL structures. The model generates creative and diverse SQL outputs that differ from the ground truth but remain equally correct. This suggests an in-depth understanding of SQL content rather than simple recollection. Few-shot SQL-PaLM can infer relevant SQL phrases based on semantic implication. It demonstrates the ability to understand queries like “French singers” and “country=France,” as well as translating concepts like “young to old” to “OrderBy age ASC.” This capacity of the model is linked to its large-scale pre-training in language.

The model’s evaluation in real applications is very significant and should be considered. The problems that arise from differences in vocabulary, schema modifications, domain-specific language, and perturbed database schemas affect the model for practical use. To assess model strength, the research considers different Spider variants (Spider-Syn, Spider-Realistic, and Spider-DK) that reduce problems by synonym replacement, removal of schema mentions, and incorporation of domain knowledge into the schema. Few-shot SQL-PaLM attains a significant enhancement of 24% over ChatGPT and a slight improvement over earlier fine-tuned SOTA on Spider-Realistic. Fine-tuned SQL-PaLM also performs slightly better, with an on Spider-Realistic and Spider-DK over earlier SOTA. Fine-tuning is required for LLMs to learn how to consume feedback (Re-enforced Learning Human Feedback) and to make use of several rounds of additional feedback which is yet to be looked at.

This model has potential with extended research and refinement. It is necessary to understand the scalability of the model with further testing.

Symbolic memory framework

ChatDB (Hu et al.) is a framework designed to fulfil user requests in natural language by manipulating a symbolic memory (an external database) based on user inputs and existing database tables. The framework consists of three main stages: input processing, chain-of-memory, and response summary. The researchers claimed the results show that their model outperforms the baseline model ChatGPT. The experiments were conducted on a synthetic “Fruit Shop Dataset” for quantitative comparison with 70 records arranged chronologically, staying within ChatGPT’s token limit. They collected 50 questions with answers for evaluation with varied difficulty.

Unlike ChatGPT, ChatDB can handle an infinite number of records without performance loss and process records one by one unaffected by the total number. ChatDB stores information from records in symbolic memory and uses SQL statements to answer questions. While ChatGPT includes records in the prompt and asks questions directly.

ChatDB significantly outperforms ChatGPT, particularly excelling in handling difficult, multi-hop reasoning questions (41/50 correct answers with an accuracy of 82% vs. 11/50 correct answers with an accuracy of 22% correspondingly). ChatDB’s symbolic memory prevents error accumulation and enhances reasoning and calculation capabilities. The advantages include decomposing complex problems into manageable steps using a chain-of-memory approach and leveraging symbolic memory for precise operations and calculations. By integrating external databases as symbolic memory, ChatDB demonstrates promising performance superiority over ChatGPT in various question-answering scenarios.

Though the researchers claim ChatDB can handle an infinite number of records without performance loss, it has been only tested using a limited dataset of 70 records. Further validation and evaluation are necessary to prove its capability.

Conclusion

We have explored the integration of Large Language Models into the conversion of natural language to text. However, this implementation comes with several challenges that need to be addressed for a successful outcome. The academic research discussed in this article serves as a valuable basis for delving into the practical applications of Text-to-SQL in industries.

I think the Case-Based Reasoning-based framework CBR-ApSQL (Guo et al.) and LLM-based Text-to-SQL model SQL-PaLM (Sun et al.) has potential. Both used Spider databases for testing. It might be better to implement SQL-PaLM as the approaches suggested are Few-shot and fine-tuning but CBR-ApSQL requires a 3-stage process. While the accuracy reported is 99%, it would depend on how CBR-ApSQL’s adaptive retrieval mechanism works on a larger dataset. While Few-shot SQL-PaLM significantly surpasses zero-shot ChatGPT, few-shot GPT-4, and few-shot CodeX-Davinci by margins of 17%, 9.9%, and 15.8% respectively, using a simple prompt. This has been also qualitatively tested using human experts which improved the reported accuracy.

This exploration could involve testing the robustness of these models to efficiently handle a wide range of SQL operations and a much larger real dataset. Additionally, there’s potential to investigate methods that are less computationally intensive while maintaining effectiveness.

About the Author:
Rosemary Thomas is a Senior Technical Researcher here at Version 1.

--

--