Fine-Tuning the LLM Mistral-7b for Text-to-SQL with SQL-Create-Context Dataset

How we fine-tuned the LLM Mistral-7b model using the b-mc2/sql-create-context dataset and deployed it on the Hugging Face Hub.

Frank Morales Aguilera

Published in

The Deep Hub

6 min readMar 10, 2024

Frank Morales Aguilera, BEng, MEng, SMIEEE

Boeing Associate Technical Fellow /Engineer /Scientist /Inventor /Cloud Solution Architect /Software Developer /@ Boeing Global Services

Introduction

The LLM Mistral-7b, a robust language model, has demonstrated remarkable performance across various natural language processing tasks. Notably, it has shown great promise in text-to-SQL, which involves transforming natural language queries into SQL queries. This article will delve into the process of fine-tuning the LLM[1b] Mistral-7b model using the b-mc2/sql-create-context dataset and the subsequent deployment of the fine-tuned Model on the Hugging Face Hub, a platform for sharing and collaborating on transformer models.

👉LLMs today have a diverse range of applications. For instance, check out this excellent tutorial to learn how to use LLMs to generate clips from long videos and build your own video editor.

Build your own AI Video editor with Node.js, AssemblyAI & StreamPot

Use LLMs to generate clips from long videos

differ.blog

Fine-Tuning the LLM Mistral-7b

Fine-tuning[1a] involves training a pre-trained model on a new dataset to adapt it to a specific task. In this case, the task is Text-to-SQL, and the dataset is b-mc2/sql-create-context. This dataset contains pairs of natural language queries and their corresponding Structured Query Language (SQL) queries, which are ideal for this task.

The fine-tuning process involves several steps:

Data Preparation: The b-mc2/sql-create-context dataset is first preprocessed to match the input format required by the LLM Mistral-7b model. This typically involves tokenization and formatting the data into input-output pairs.
Model Configuration: The LLM Mistral-7b model is configured for the Text-to-SQL task. This consists of setting up the model architecture and hyperparameters.
Training: The Model is then trained on the preprocessed dataset. During training, the Model learns to map the natural language queries to their corresponding SQL queries.
Evaluation: The Model is evaluated on a separate validation set to measure its performance after training.

Fine-Tuning with QLoRA:

The QLoRA (Quantization and LoRA) method combines quantization and LoRA adapters to enhance model performance.
You can follow these steps:
Convert your code base to a synthetic dialogue-based train-test dataset.
Fine-tune Mistral 7B using QLoRA.
Evaluate the performance of the newly fine-tuned model.
Optionally, compare it with the base model alongside GPT-4[1c].
Merge the adapter with the base model if needed.

PEFT Library:

Use the PEFT library from Hugging Face to facilitate the fine-tuning process.

Text-to-SQL

Text-to-SQL is a task in natural language processing (NLP) that aims to generate SQL queries from natural language text automatically [1]. This task involves converting the text input into a structured representation and then using this representation to generate a semantically correct SQL query that can be executed on a database[1–2].

Several tools are available to help with this task. For example, Text2SQL.AI generates SQL queries with AI[3]. It can translate, explain, and fix complex SQL queries using your native language[3]. Another tool, EverSQL, also offers a text-to-SQL feature that allows users to easily convert plain text into SQL queries [4].

These tools can be handy for those needing to become more familiar with SQL syntax, as they allow users to query databases using natural language5. This makes querying databases more accessible to non-technical users[5].

SQL-Create-Context Dataset

The SQL-Create-Context Dataset is built from WikiSQL and Spider[5a-5b]. It contains 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context[5a-5b].

This dataset was built with text-to-SQL Language Models (LLMs) in mind, intending to prevent the hallucination of column and table names often seen when trained on text-to-SQL datasets[5b]. The CREATE TABLE statement can frequently be copied and pasted from different DBMS and provides table names, column names, and their data types[5b]. By providing just the CREATE TABLE statement as context, we can offer a better grounding for models without providing actual rows of data, limiting token usage and exposure to private, sensitive, or proprietary data[5b].

Here are a few examples of the data in the SQL-Create-Context Dataset[1]:

{
“question”: “Please show the themes of competitions with host cities having populations larger than 1000.”,
“context”: “CREATE TABLE city (City_ID VARCHAR, Population INTEGER); CREATE TABLE farm_competition (Theme VARCHAR, Host_city_ID VARCHAR)”,
“answer”: “SELECT T2.Theme FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID WHERE T1.Population > 1000”
},
{
“question”: “Please show the different statuses of cities and the average population of cities with each status.”,
“context”: “CREATE TABLE city (Status VARCHAR, Population INTEGER)”,
“answer”: “SELECT Status, AVG (Population) FROM city GROUP BY Status”
}

This dataset is an excellent resource for training models for the Text-to-SQL task[5a-5b].

Deploying the Model on Hugging Face Hub

Once the Model is fine-tuned and evaluated, it can be deployed on the Hugging Face Hub. The Hugging Face Hub is a platform for sharing[6] and collaborating on transformer models. Deploying the Model involves the following steps:

Model Conversion: The fine-tuned Model is first converted into a format compatible with the Hugging Face Hub.
Uploading the Model: The converted Model is then uploaded to the Hugging Face Hub. This involves creating a new model repository and pushing the model files to this repository.
Model Documentation: After uploading, the Model is documented to provide users with information about the Model, including its purpose, how it was trained, and how to use it.

Case study

I developed a notebook[7] using the Flash attention-2 library that covers the main topics of this article, including the deployment of the new Model on the Hugging Face Hub[8–9].

Our model demonstrated its prowess when evaluated on 1000 samples[10] from the evaluation dataset, achieving an impressive accuracy of 82.60%. This high level of precision is a testament to the model’s effectiveness and Potential.

Conclusion

The process of fine-tuning the LLM Mistral-7b for Text-to-SQL with the b-mc2/sql-create-context dataset and deploying the Model on the Hugging Face Hub holds immense promise in enhancing the performance of Text-to-SQL systems. This approach harnesses the power of the LLM Mistral-7b model and the richness of the b-mc2/sql-create-context dataset to create a model that can accurately convert natural language queries into SQL queries. The Hugging Face Hub then provides a dynamic platform for sharing and collaborating on this Model, further amplifying its utility and impact.