Natural Language to SQL: Experiments with CodeLlama on Amazon SageMaker — Part 2

Philipp Kaindl
9 min readApr 10, 2024

--

Experiments with CodeLlama for NL2SQL

Part 1 of the series explores fine-tuning a CodeLlama model for NL2SQL tasks using QLoRA on Amazon SageMaker. The blog guides you through setting up the development environment, preparing the fine-tuning dataset, and dives into the importance of tokenization and chunk length. It concludes with detailed steps on fine-tuning CodeLlama using QLoRA and deploying the model on SageMaker. Finally, we compare the improvements made over the base models by only a few epochs of training. If you missed part 1 and are looking for a practical how-to guide, follow this link [1].

In part 2, we will examine the experiments we undertook with CodeLlama more closely, as they will serve as helpful guidelines whenever we undertake such an endeavour with other models.

TLDR what you can find in this blog:

  • A comparison between different database descriptions and their pros and cons (SQL-PaLM-inspired and Data Definition Language). A quick check to see if the model is able to generalize between both database descriptions.
  • Investigations on the chunk length of the training batches and its link to the performance of the trained model.
  • Influence of the repetition penalty on the performance of CodeLlama.

Exploring Database Description Methods DDL vs SQL-PALM:

First, we’re going to examine the differences between the Data Definition Language (DDL) and the SQL-PALM-based database schema descriptions that we used in Part 1.

DDL, or Data Definition Language, is a subset of SQL (Structured Query Language) used in database management systems. It focuses specifically on defining, modifying, and removing database structures but not on manipulating the data within these structures. If you are not too much into SQL — like I was at the beginning of this journey — here’s a brief overview:

Key Features of DDL for NL2SQL

DDL has three key features that we can use to our advantage:

Creating Database Structures:
DDL is primarily used to define the structure of a database and its elements, such as tables, indexes, and constraints. Here’s a simple example of a DDL command to create a table:

CREATE TABLE Students (
StudentID int,
Name varchar(255),
Age int,
Primary key (StudentID)
);

This CREATE TABLE statement defines a new table named Students with columns for student ID, name, and age, and declares the StudentID column as the primary key.

Defining Constraints:
DDL statements include constraints to define rules for data in the database, such as primary keys, foreign keys, unique constraints, and check constraints. Revisiting the example from above, the table Courses and Students are linked through their Enrollments via foreign and primary keys. Therefore, we have all the information needed in the DDL to construct our queries.

CREATE TABLE Students (
StudentID int,
Name varchar(255),
Age int,
Primary key (StudentID)
);
CREATE TABLE Courses (
CourseID int,
CourseName varchar(255),
Primary key (CourseID)
);
CREATE TABLE Enrollments (
EnrollmentID int,
StudentID int,
CourseID int,
EnrollmentDate date,
Primary key (EnrollmentID),
Foreign key (StudentID) references Students(StudentID),
Foreign key (CourseID) references Courses(CourseID)
);

Ease of recreating a valid DDL statement for the database schema
DDL allows for uniform syntax across different databases, making it easier to target specific flavors of SQL than SQL-PaLM schema definition.

This ease of creation naturally tends to the subjective observation that more datasets utilize the DDL format for describing the databases. If your LLM has already been pre-trained on data for SQL, the might be a higher chance it has been trained on DDL than on SQL-PALM datasets.

Additionally, you can extend the type definition for the tables with example data or a description of the content of the table and column. This led to a significant improvement when experimenting with closed-sourced models like Anthropic Claude 2 or OpenAI GPT-4. One example of how to give extra context and business knowledge through comments can be seen below:

CREATE TABLE Students (
StudentID int, -- The unique identifier for each student
Name varchar(255), -- The name of the student
Age int, -- The age of the student
Primary key (StudentID) -- The primary key for the table, based on the StudentID column
);

Although this is only a qualitative anecdote, I encourage the reader to add those examples and comments to their training datasets and observe the difference. Please leave a comment if you made the same observation.

Token length for DDL and SQL-PaLM

An essential aspect of a database description framework for the NL2SQL tasks is the number of tokens needed to describe our table. Therefore, we are inspecting and comparing the token lengths for the databases in the training set of the Spider SQL dataset, once for SQL-PaLM and DDL format.

Token length for both the SQL-PaLM and the DDL description of the databases of the training set of SPIDER SQL. The DDL needs more tokens to describe the dataset.

The plot clearly indicates that the DDL description — just like the SQL-PaLM — has distinct spikes for every database and the associated training questions (The training set has 180 unique databases). However, the DDL version seems to consume considerably more tokens to describe the underlying structure of the database.

To further inspect the tokenization, we will visualize the tokenization by color coding it. For the SQL-PALM notation:

SQL-PALM descriptions tokens are visualized by color-coding them._ indicate a space for better readability.

And for the DDL-notation:

DDL descriptions tokens are visualized by color-coding them. Clear to observe is the increased token overall. <0x0A> is the Hex for "\n" newline. You can check the tokenizer vocabulary via the following link.

We can observe, that for the DDL we need a lot more tokens to represent the table. It is unclear if it would be beneficial to clear out \n as well as and have a conversion function to migrate it to a more human readable, natural text format — more tests are necessary on this end.

Model performance depending on the database description

First, let’s examine the performance of the models after training them for a few epochs with the different schema descriptions:

Validation Set Execution Match Accuracy Performance of different training runs

Analyzing the performance scores from the table provided in the context of the database schema descriptions (DDL vs. SQL-PALM) and other training parameters, we can draw several insights:

Performance Based on Schema Description

  • SQL-PALM Descriptions: The runs using SQL-PALM descriptions show a performance score range from 38.59% to 60.35%. The lower end of this range is represented by the base models, with the 13B and 7B models achieving 38.59% and 38.88%, respectively. This finding indicates that neither of the base models didn’t undergo extensive training in Text-2-SQL.
  • DDL Descriptions: The runs with DDL descriptions show a marked improvement, with scores of 56.67% for the 7B model at five epochs and an even higher score of 59.38% for the 13B model at already three epochs. This suggests that the more verbose and detailed DDL format might provide a richer context or structure that the model can leverage to improve performance. However, this comes at the cost of a higher token usage by the model.

Influence of Training Duration

1 vs. 5 Epochs: When looking at the 7B model trained for one epoch with SQL-PALM, we see a performance score of 49.71%. This score increases slightly to 50.87% when the model is trained for five epochs, indicating a modest improvement with additional training for SQL-PALM data.

Effect of Model Size

There is a clear trend that the larger 13B models outperform the 7B models. Although starting from a similar baseline, the larger models significantly outperform the smaller models when fine-tuned. However, a further parameter study on the size of the LoRA Adapter should be undertaken to judge more this more closely.

Generalization from SQL-PALM to DDL

The performance comparison between models trained on DDL and then used for inference with a SQL-PALM schema offers insightful conclusions. Specifically, the CodeLlama 7B model, which was trained exclusively on a DDL schema for five epochs and then tested with SQL-PALM for inference, exhibited a performance of 51.64%. This is a notable decrease from the 56.67% performance score achieved when inference was conducted with DDL on the same model configuration.

This decrease suggests that while the model has been well-trained to understand and generate predictions based on DDL, it may not generalize as effectively when presented with a schema in a different format, such as SQL-PALM, during inference. The difference in performance shows the importance of consistency between training and inference formats. It also suggests that for optimal performance, additional fine-tuning or a dedicated training regimen might be necessary for the model to become equally proficient with SQL-PALM formatted inputs as it is with DDL. This finding is crucial for practitioners who may consider switching between schema formats, indicating the potential need for format-specific fine-tuning to maintain high performance.

Conclusion on the formatting for NL2SQL fine-tuning

We can conclude that both SQL-PALM and DDL yield comparable performance results. The availability and ease of generation of the DDL format is the strongest driver for my decision to continue with the DDL format only.

Analyzing Chunk Length

Building on our exploration from the first blog, where we navigated the intricacies of tokenization and chunk length for natural language processing in decoder-only models, we now turn our focus to the tangible impact these factors have on model performance, particularly for complex database schemas. This is a continuation of our deep dive into fine-tuning the CodeLlama model for the Natural Language to SQL (NL2SQL) tasks, where the crux lies in the model’s ability to translate verbose instructions into precise SQL queries.

In Part 1, we discussed the challenges associated with lengthy instructions that exceed the chunk length, potentially leaving the model with incomplete context for learning. As we move forward, we investigate how varying the chunk lengths affects the model’s performance, especially when dealing with complex examples that are critical to our model’s understanding and generalization capabilities.

We conducted experiments with the 7B CodeLlama model, varying the chunk lengths across four different training jobs — 256, 512, 1024, 2048, and the notably larger 4096 tokens — to observe the performance impact on our dataset. Below are the extracted results explicitly pertaining with these chunk lengths:

Performance over the chunk length for training CodeLlama 7B Model with DDL schema information.

The performance scores clearly illustrate that chunk length plays a significant role in model efficacy. The smaller chunk length (256 and 512 tokens) result in notably lower performance scores of 31.72% and 44.97%, respectively. This suggests that the model is struggling to comprehend the full context of the instructions within these constrained token limits.

As we increase the chunk length to 2048 tokens, a standard size for many model training scripts, we see an improvement in performance across the board. The model’s ability to grasp the entire context — likely fitting most of the instruction and solution pairs within a single chunk — translates into better performance scores, ranging from 49.71% to 51.64%.

Interestingly, when we expand the chunk length further to 4096 tokens, we observe a significant leap in performance to 55.42%. This indicates that for complex queries, where the database schema and instructions are exceptionally verbose, larger chunk length can provide the model with a more holistic view, thus enabling it to generate more accurate SQL queries.

These findings underscore a crucial consideration for those working with decoder-only models like GPT-style architectures: the trade-off between chunk length and performance. By strategically choosing the chunk length, we can optimize the training process, particularly for complex examples that are paramount for the model’s learning.

Conclusion

In conclusion, this part of our exploration reveals that while managing tokenization and chunking is a delicate balancing act, doing so effectively can have a substantial impact on the performance of an NL2SQL model. It offers a stepping stone for those seeking to fine-tune their models, providing a benchmark for how chunk length can influence the accuracy and reliability of the resulting model. It also creates a bridge to Part 1, reinforcing the importance of the model’s training environment on SageMaker and how these foundational decisions play a pivotal role in achieving superior model performance.

--

--

Philipp Kaindl

Sr. Applied Scientist Generative AI @ AWS. Opinions are my own.