Comparing Open-source LLMs for NL-to-SQL

Published in

Dataherald

4 min readOct 31, 2023

Background

The NL-to-SQL (Natural Language to Structured Query Language) task is a hard problem in the field of natural language processing (NLP). It involves transforming natural language questions into SQL queries, which can then be executed against relational databases to answer the question. This task is a specialized subfield within NLP and is closely related to the broader domain of natural language understanding (NLU) and the interface between natural language and databases (NLIDB).

With the recent advances in the development of Large Language Models (LLMs), such as GPT-4, Llama2, and Falcon the focus in industry and academia excel for NL-to-SQL has moved to leveraging these LLMs to generate SQL for real world use cases. This would be immensely powerful as it would allow non-technical users to directly find insights from data.

At Dataherald, we have built an open source natural language-to-SQL engine which can be used with different LLMs, though we are using GPT-4–32K in our enterprise deployments. The current state of the art research for NL-to-SQL such as DAIL-SQL, C3, and DIN-SQL also uses closed-source LLMs like GPT-4 and GPT-3.5-turbo. These models are both expensive and raise data privacy concerns for enterprises. Therefore we set out to see how open-source LLMs, like Llama2 and Mistral stack up against OpenAI’s models. The following are our results.

Open-source LLMs

In this blog post, we will explore the capabilities of open-source LLMs (Large Language Models) from various families. The information shared here is derived from three recent papers, namely Battle of the Large Language Models, Text-to-SQL Empowered by Large Language Models, and Decomposed In-Context Learning of Text-to-SQL as well as our own internal testing conducted using a Google Colab A100 GPU.

The list of LLMs looked at was:

Llama-7B
Llama-33B
MISTRAL-7B
Alpaca-7B
Llama-2-CHAT-7B
Llama-2-CHAT-13B
Vicuna-7B
Vicuna-33B
BARD-LAMDA
BARD-PALM2
GPT-3.5-turbo
GPT-4

Some of these models, like Llama, Llama2, and MISTRAL , are pre-trained models similar to GPT-3.5-Turbo, that have undergone supervised fine-tuning and contrastive fine-tuning. The other models have gone through an alignment process, which involves additional instruction tuning, and essentially share the same architecture as the pre-trained ones. In particular, Vicuna, Guanaco, and Alpaca are aligned versions of the Llama model trained on specific datasets.

Zero-shot NL-to-SQL Performance

In this section, we will evaluate the zero-shot performance of open-source LLMs and contrast it with their larger closed-source counterparts. Zero-shot performance refers to the LLM’s ability to generate a SQL query solely based on the given question and the corresponding database schema referenced by the question, without any few shot examples in the prompt.

To ensure a fair comparison, we maintained consistent prompts across all LLMs used in our assessment. The specific prompt employed for reporting zero-shot performance is based on the template proposed by Rajkumar et al., known for its superior performance when compared to other prompt formats.

We evaluated the performance of these models based on execution accuracy, which involves executing both the generated SQL query by the model and the reference SQL query on the database, and then comparing their results. The results are obtained by using the LLMs on the development set of the Spider dataset.

Here are the results:

Takeaways

The takeaways are clear

Closed-source models (GPT models and BARD) significantly outperform their open-source counterparts in NL-to-SQL. It is safe to assume this is due to the higher number of parameters they were trained on.
Models aligned with an additional supervised fine-tuning step exhibit a notable performance improvement when compared to their predecessor models. For example, the Alpaca-7B model showcases nearly a 16 percent improvement over its predecessor, Llama-7B. This underscores the potential of fine-tuning to achieve enhanced performance using the same underlying architecture.
Newer open source models such as Mistral-7B and Llama2 exhibit superior performances compared to the predecessors and are closing the gap with closed-source models.

Looking Ahead

For NL-to-SQL workloads while newer open source models are closing the gap with the OpenAI models, there is still a substantial gap when it comes to accuracy out-of the-box. However, it seems that fine-tuning for a specific dataset can greatly improve accuracy even on the same architecture.

At Dataherald, we are planning to add API support to our engine to enable fine-tuning models for NL-to-SQL workloads. We will be sharing results from our fine-tuned models in future posts.

About Dataherald

Sign up for free and use the hosted version of Dataherald
Our open-source engine is available on Github.
Join our Discord server to learn more about the project.