Fine-Tuning the LLM Mistral-7B-Instruct-v0.3 for Text-to-SQL with SQL-Create-Context Dataset and Enhanced Training Techniques

Published in

The Deep Hub

5 min readJun 25, 2024

Frank Morales Aguilera, BEng, MEng, SMIEEE

Boeing Associate Technical Fellow /Engineer /Scientist /Inventor /Cloud Solution Architect /Software Developer /@ Boeing Global Services

Introduction

In the rapidly evolving landscape of natural language processing, the ability to transform natural language queries into structured SQL queries is paramount. Large language models (LLMs) have shown promise in this domain, but fine-tuning them for specific tasks remains challenging. This article builds upon my previous work on fine-tuning the Mistral-7b model for Text-to-SQL tasks using the SQL-Create-Context dataset. We delve into enhanced techniques to further refine the model’s performance, leveraging readily available cloud resources like Google Colab’s GPUs and Google Cloud storage. By incorporating the evaluation dataset directly into training, employing weight decay, and implementing early stopping, we aim to improve the model’s accuracy and generalization capabilities. Additionally, we explore how to optimize resource utilization within the Google Colab environment and discuss the scalability of our approach using Google Cloud storage, making it accessible to a broader audience.

Enhanced Fine-Tuning with SFTTrainer

The core of our fine-tuning process revolves around the SFTTrainer function. In this updated approach, we’ve integrated the evaluation dataset directly into the training workflow. This allows the model to learn from the training and evaluation data, potentially leading to better generalization and performance on unseen examples.

Furthermore, we’ve introduced weight decay (weight_decay=0.01) to the optimizer. Weight decay acts as a regularization technique, preventing the model’s weights from becoming too large and thus mitigating overfitting.

We’ve incorporated early stopping to monitor the model’s progress and prevent overfitting. The EarlyStoppingCallback monitors the validation loss and halts training if the loss doesn’t improve for a specified number of evaluation steps (early_stopping_patience=3 in our case).

Refined Training Configuration

In addition to the changes above, we’ve refined the training configuration with the following settings:

load_best_model_at_end=True: Ensures that the best-performing model checkpoint (based on the evaluation metric) is loaded at the end of training.
logging_dir=”/content/gdrive/MyDrive/model/Mistral-7B-text-to-sql-flash-attention-2-dataeval/logs”: Specifies the directory where training logs will be saved.
evaluation_strategy=” steps,” eval_steps=10: Evaluate the model on the validation set every ten steps.
save_strategy=” steps,” save_steps=10: Saves model checkpoints every ten steps.
metric_for_best_model = “loss”: Uses the validation loss as the metric to determine the best model checkpoint.
warmup_steps=15: Gradually increases the learning rate during the initial 15 training steps.

Leveraging the Mistral-7B-Instruct-v0.3 Base Model

A significant change in this updated approach is utilizing the Mistral-7B-Instruct-v0.3 base model. This model likely incorporates advancements and refinements over its predecessor, potentially contributing to improved performance in our Text-to-SQL task.

Case study

I developed a notebook to support this article. The notebook is for fine-tuning and evaluating the fine-tuned model. Notebook #2 assesses the model inference capabilities with a Perplexity score of 10.40 and Accuracy (Eval dataset and predict) for a sample of 10: 80.00%. Also, I was able to embed execution capabilities in Notebook #2 when the generated queries matched with the original queries in the testing dataset.

Figure 1 displays four line graphs that track the progression of a machine learning model’s training process over epochs (iterations through the training dataset). Figure 2 displays four line graphs that monitor a machine learning model's evaluation (not training) performance over epochs.

Figure 1: Training metrics

Figure 2: Evaluation metrics

Table 1: Training results

Figure 3: Evolution of Training and Validation Loss During Model Optimization

Based on the combined analysis of both the training and evaluation metrics, Table 1 and Figure 3, the following conclusions can be drawn:

Training:

The model learned effectively during training, as evidenced by the significant decrease and stabilization of the training loss.
The consistent learning rate, while unusual, has worked well for this model and dataset, as indicated by the smooth progression of the gradient norm.
Given the stable training loss and gradient norm, the model did not exhibit signs of overfitting during training.

Evaluation:

The model generalized well to unseen data, as demonstrated by the initial decrease in the evaluation loss.
However, the evaluation loss plateaued and slightly increased towards the end, suggesting potential overfitting or a limitation in the model’s capacity to generalize further.
The evaluation process became more efficient over time, potentially due to code or hardware optimizations.
The number of evaluation steps per second remained constant, indicating consistent batch processing.

Overall Conclusion:

The model demonstrates strong learning capabilities and good initial generalization performance. However, the plateau and a slight increase in the evaluation loss towards the end suggest potential overfitting or a limitation in generalizing further.

Recommendations:

To address the potential overfitting, consider:
Early stopping: Stop training when the evaluation loss increases or plateau. (Already implemented in a notebook ): trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=3))
Regularization techniques: Apply techniques like L1/L2 regularization or dropout to prevent the model from becoming too complex.
Data augmentation: Increase the diversity of the training data to improve the model’s ability to generalize.
To investigate the efficiency gains in the evaluation process, analyze the code and hardware setup for potential optimizations that could be applied to the training process.

Implementing these recommendations makes it possible further to improve the model’s performance and generalization capabilities.

Conclusion

By integrating the evaluation dataset into training, employing weight decay, implementing early stopping, and leveraging the updated Mistral-7B-Instruct-v0.3 base model, we have significantly enhanced the fine-tuning process for text-to-SQL tasks. These refinements, achieved using accessible cloud resources like Google Colab and Google Cloud Storage, have resulted in a model demonstrating strong learning capabilities and good initial generalization performance. While there are indications of potential overfitting or limitations in further generalization, we have outlined practical recommendations to address these issues, such as early stopping and regularization techniques. The ability to fine-tune such powerful models using readily available cloud resources democratizes access to advanced NLP capabilities, potentially benefiting businesses and developers with limited computational resources. Future research could explore fine-tuning even larger language models, experimenting with diverse datasets and architectures, or applying this model to other text-to-SQL tasks, further pushing the boundaries of natural language understanding and database interaction.