Natural Language to SQL: Fine-Tuning CodeLlama with Amazon SageMaker — Part 1

25 min readDec 7, 2023

Practical strategies for fine-tuning CodeLlama using Quantized Low-Rank Adaptation(QLoRA) on Amazon SageMaker.

Image generated by the author through Leonardo.ai

Natural Language to SQL (NL2SQL) has been a field of open research for the last years, as the integration of large-scale language models in business applications is rapidly evolving.

With the advancements of Generative AI and Large Language Models (LLMs) for code generation, the focus of the community has partially shifted from using smaller fine-tuned models for NL2SQL to larger, commercially available LLMs like GPT-4, Anthropic Claude, or Amazon Titan. Although the zero-shot capabilities of those models are impressive on benchmark datasets (e.g., SPIDER), their performance sees substantial improvements when using few-shot prompting. These results are encouraging for ML practitioners to realize the dream of using natural language for data analytics. Still, many real-world challenges remain outside the benchmarks.

Small Language Models

Why would you go to the lengths of fine-tuning your own smaller model, like CodeLlama or Mistral on NL2SQL? There are many reasons, but the main ones are:

Data privacy and full control over the model
Your data, your model. When you fine-tune an open-source model, you are in total control. Run it on the cloud? Sure! Are your databases still on-prem? Then you can take it and run it close to them as well.
Overfitting is good
In the current benchmarks, we are trying to test our models out-of-sample by presenting them with unseen databases and tables. However, if you fine-tune your own model on your own data, then it is a good thing if the model has already seen the tables and databases before in the training data. Of course, this also introduces the problem of the training data becoming stale as the tables change. We will discuss strategies to combat this problem in the following blogs and below.
LoRA Adapters
When you fine-tune your model with LoRA, you can load and unload the LoRA adapters at runtime. As LoRA adapters can be loaded and unloaded by simple addition and subtraction of the weight matrices and are independent of each other, we can swap in and out our fine-tuning knowledge.
With this ability, we could train one LoRA adapter per database or business domain. If the underlying database schema changes with the consequence of a performance degradation on the generation of correct SQL statements, we simply unload that adapter, retrain it, and load it again.
Finding the right information
In contrast to most benchmarks, where the correct database and tables are already supplied, you first need to find the correct information to construct a query. A well-oiled system for finding accurate information is crucial. However, it can still be improved by fine-tuning our model to our systems. Especially if the table and column names are not that descriptive, fine-tuning can help make the right choice for the model and goes hand in hand with overfitting.

With that in mind, let’s look at a model family that has reportedly been able to outperform even GPT-4 on the task of NL2SQL — Llama2 (e.g., the blog post by Kourosh Hakhamaneshi and Rehaan Ahmad). One notable derivate of the Llama2 model is CodeLlama, which has been further trained on coding tasks. In the following blog, we will investigate how to fine-tune CodeLlama with Amazon SageMaker. More blogs about experiments on the chunk length, table schema format, repetition penalty, and synthetic data generation with Amazon Bedrock will follow.

TLDR what you can find in this blog:

A recipe on how to prepare your data and how to fine-tune CodeLlama on Amazon SageMaker with QLoRA, a Parameter Efficient Fine Tuning (PEFT) technique.
The influence of the chunk length on the performance of the fine-tuned model.
Deployment to production-grade endpoint with Amazon SageMaker.
Further improvement ideas for using the methodology on your own company’s data.

NL2SQL spans many interesting challenges (information retrieval, synthetic training data generation with LLMs etc.); however, our focus will not be on how to achieve the highest scores on benchmarks but on how to fine-tune a model for your company.

Why CodeLlama?

What makes CodeLlama unique is its specialization: it’s fine-tuned with a massive 500 billion tokens related to coding tasks. This means that, compared to its base model, it processes more data with a substantial context length of 16K tokens during training. This additional context length gave it the possibility to take more surrounding code into account when trying to infill code e.g., or when used for auto-completion.

To give you an overview of the training methodology for CodeLlama, please consult the original paper (Rozière et al., 2023) or the diagram below.

The training methodology for CodeLlama can be seen below as outlined in the original paper (Rozière et al., 2023). — The training methodology for CodeLlama is outlined in the original paper (Rozière et al., 2023).

QLoRA and PEFT

Although Amazon SageMaker offers us the ability to scale our compute, we want to be frugal with the resources consumed.

Therefore, our fine-tuning approach with CodeLlama will utilize QLoRA on Amazon SageMaker.
QLoRA stands out for its ability to effectively quantize pre-trained language models to 4 bits without substantially compromising performance.

Parameter-Efficient Fine-Tuning (PEFT) methods are a way to adapt pre-trained language models for specific tasks while minimizing resource usage. Unlike traditional fine-tuning, which adjusts all of a model’s parameters, PEFT focuses on modifying only a small subset. This approach significantly reduces the computational and storage demands often associated with fine-tuning large-scale LLMs. Despite altering fewer parameters, PEFT methods can achieve performance levels comparable to those of full fine-tuning. The Huggingface library PEFT is particularly useful for adapting large models efficiently and supports tools like Accelerate, which facilitates their use with large-scale models through integrations like DeepSpeed and Big Model Inference.

Before continuing further, I would like to thank Phil Schmid from Huggingface for his excellent blog on fine-tuning LLaMA 2 models, which was the base for most of my work.

1. Setting up the development environment

To get started with your fine-tuning journey on Amazon SageMaker, all you need is your laptop, that has the correct permissions setup to interact with Amazon SageMaker.

To get started, we will need to configure the access to
1. Hugging Face — Access CodeLlama model
2. Weights & Biases — Track training progress with Wanda library
3. Amazon SageMaker — Run the actual training.
Furthermore, we need to make sure that the correct libraries are installed. CodeLlama, for instance, is available from transformers version 4.33.0. Use the following commands to ensure that you have the needed packages:

import sys
import subprocess
import os

required_packages = [
    "sagemaker>=2.192.0",
    "huggingface_hub",
    "hf-transfer",
    "transformers==4.33.0",
    "datasets",
    "wandb",
    "matplotlib",
    "python-dotenv"
]
# Check if the required packages are installed
not_installed = []
for package in required_packages:
    try:
        __import__(package.split("=")[0].split(">")[0])
    except ImportError:
        not_installed.append(package)
# Install the missing packages
if not_installed:
    subprocess.check_call(
        [sys.executable, "-m", "pip", "install", *not_installed, "--upgrade"]
    )

To access CodeLlama assets, log into your Hugging Face account. Create a .env file that holds your Huggingface Hub token. If you work locally, you can take advantage of it and specify your SageMaker Execution role there as well. We will use dotenv to read the sensitive information from there.

%matplotlib inline 
from dotenv import load_dotenv
import matplotlib.pyplot as plt
# Load the environment variables from the .env file
load_dotenv()

# Get the Hugging Face token from the environment variable
huggingface_token = os.getenv("HUGGINGFACE_TOKEN")
# Check if the token is available
if huggingface_token is None:
    raise ValueError("Hugging Face token not found. Please check your .env file.")
# Login using the Hugging Face CLI with the token
subprocess.run(["huggingface-cli", "login", "--token", huggingface_token])

For tracking purposes, we are going to connect the SageMaker training container to Weights and Biases, where we can comfortably check the progress made, even when on the go.

Next, log into your wandb account from the output cell output. wandb.sagemaker_auth(path=”scripts”) saves the login data to a scripts folder that we are going to pass on to the training container. wandb will fetch the login details from there automatically. However, we still need to make sure to install wandb in the training container — which will be done via the requirements.txt.

import wandb

wandb.login()
wandb.sagemaker_auth(path="scripts")

If you are going to use Sagemaker in a local environment, you need access to an AWS Identity and Access Management(IAM) role with the required permissions for Sagemaker. You can find more information about it here. Next, we are going to set up all our connections to the Amazon SageMaker service.

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
local_mode = True
if not local_mode:
    if sagemaker_session_bucket is None and sess is not None:
        # set to default bucket if a bucket name is not given
        sagemaker_session_bucket = sess.default_bucket()
    try:
        role = sagemaker.get_execution_role()
    except ValueError:
        iam = boto3.client("iam")
        role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
else:
    # Load .env file
    load_dotenv()
    # Get the SageMaker execution role
    role = os.getenv("SAGEMAKER_EXECUTION_ROLE")
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"SageMaker Version: {sagemaker.__version__}")

2. Load and prepare the dataset

Our experiment will be based on the Spider Dataset, which is an open-source dataset for natural language to SQL, licensed under the CC BY-SA 4.0 license. In a preprocessing step, I enriched the dataset that is available on the HuggingFace hub with more information on the table schemas, foreign and primary keys for fine-tuning Llama models. SQL-PaLM inspired this enrichment step and the dataset is available on the Huggingface dataset hub.

As the final dataset would most probably be used as a tool in a chat setting, our final dataset should reflect an instruction tuning approach, which incorporates instruction, context, and answer, as shown by an example below.

In the answer, we want CodeLlama to include <SQL></SQL> tags for the SQL query for easier parsing of the output.

After formatting, a training example has the following parts

Instruction
An instruction helps the model to pick up the intent. This instruction has not been tuned. Feel free to improve it!

"""
### Instruction
Given an input question, use sqlite syntax to 
generate a sql query by choosing one or multiple of the following tables.
The foreign and primary keys will be supplied. 
Write query in between <SQL></SQL>.
Answer the following question with the context below:
"""

Context
The aim is to include information about the table structure via the “context” of our query. This information can be extracted from the table itself. Below is a shortened example of a relatively simple database.

"""
### Context
[Schema (values) (types)]: | driving_school |  Addresses : address_id (text) , 
line_1_number_building (number) , city (text) , zip_postcode (text) , 
state_province_county (text) , country (text) | Staff : staff_id (text) , 
staff_address_id (number) , nickname (text) , first_name (text) , 
middle_name (text) , last_name (text) , date_of_birth (text) , 
date_joined_staff (number) , date_left_staff (number) | 
Vehicles : vehicle_id (text) , vehicle_details (number) | 
Customers : customer_id (text) , customer_address_id (number) , 
customer_status_code (text) , date_became_customer (text) , 
date_of_birth (text) , first_name (text) , last_name (text) , 
amount_outstanding (number) , email_address (number) , 
phone_number (text) , cell_mobile_phone_number (text) | 
Customer_Payments : customer_id (text) ...
"""

Task
Experimentation with API-based and non-fine-tuned model have shown that controlling for the output — to only include the SQL query — can be a challenging task. We want our model to follow precisely our formatting rules, even when running thousands of queries. For responses in JSON format, update the instructions.

"""
### Answer
<SQL> SELECT T1.state_province_county FROM Addresses 
AS T1 JOIN Staff AS T2 ON T1.address_id = T2.staff_address_id 
GROUP BY T1.state_province_county 
HAVING count(*) BETWEEN 2 AND 4; </SQL>"""

To load the philikai/Spider-SQL-LLAMA2_train dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("philikai/Spider-SQL-LLAMA2_train")
print(f"Train dataset size: {len(dataset)}")

Inspecting the dataset, we find the following features:

'db_id': Value(dtype='string', id=None),
'query': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'schema': Value(dtype='string', id=None),
'primary_keys': Value(dtype='string', id=None),
'foreign_keys': Value(dtype='string', id=None),

To instruction tune our model via the prompt parts outlined above, we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

# Play around with the instruction prompt to maximize the model performance further
def format_spider(sample):
    instruction_prompt = f"""Given an input question, use sqlite syntax to generate a sql query by choosing one or multiple of the following tables. 
    The foreign and primary keys will be supplied. Write query in between <SQL></SQL>. 
    Answer the following question with the context below: \n{sample['question']}"""
    instruction = f"### Instruction\n{instruction_prompt} "
    context = f"### Context\n{sample['schema']} | {sample['foreign_keys']} | {sample['primary_keys']}"
    response = f"### Answer\n<SQL> {sample['query']} </SQL>"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

Applying the formatting function to a random sample (shortened):

"""
### Instruction
Given an input question, use sqlite syntax to generate a sql query by choosing one or multiple of the following tables. 
The foreign and primary keys will be supplied. Write query in between <SQL></SQL>. 
Answer the following question with the context below: 
What are the different names of the colleges involved in the tryout in alphabetical order? 
p
### Context
[Schema (values) (types)]: | soccer_2 |  College : cname (text)...
[Foreign Keys]: tryout : cname = college : cname | 
tryout : pid = player : pid | 
[Primary Keys]: college : cname, player : pid, tryout : pid

### Answer
<SQL> SELECT DISTINCT cName FROM tryout ORDER BY cName </SQL>
"""

Although an integral part of any LLM application, the role of the tokenizer is often overlooked. However, adding or forgetting to add a eos_token to your datasets samples can make the difference between a successful or failed fine-tuning job.
You can download the right tokenizer from the Huggingface Hub. The pad_token gets set to the EOS token to ensure that the model will pick it up during the training process.

from transformers import AutoTokenizer

model_id = "codellama/CodeLlama-7b-hf"  # or choose the size you want
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

With that tokenizer function, we will first tokenize and check the distribution of the token lengths in our dataset.

To make sure that CodeLlama will stop predicting new tokens, we will add an eos_token at the end of each example. This ensures, that we not only pad our examples with it, but also enter it when batching up multiple examples together.

from random import randint
from itertools import chain
from functools import partial

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_spider(sample)}{tokenizer.eos_token}"
    return sample

# apply prompt template per sample
dataset_train_tokenized = dataset_train.map(
    template_dataset, remove_columns=list(dataset_train.features)
)
dataset_validation_tokenized = dataset_validation.map(
    template_dataset, remove_columns=list(dataset_validation.features)
)
print(dataset_train_format_ok[randint(0, len(dataset_train_format_ok))]["text"])

...
### Answer
<SQL> SELECT count(*) FROM Ref_locations </SQL></s>

An </s> has been added at the end of the example solution. CodeLlama will lean throughout the fine-tuning that it has to close every example with </SQL>, followed by a </s>. As soon as the </s> has been predicted, the generation will stop.

Importance of chunk length on NL2SQL fine-tuning

The next step would be to run a chunking and batching operation on our dataset. Many ML practitioners choose just to apply defaults to those operations. However, we want first to examine our dataset regarding the token length.

Histogram of the input token lengths (instruction and solution) for training. The token length is strongly governed by the complexity of the database, leading to distinctive spikes.

Observe how the token length has distinctive spikes stemming from the database schema description. As we supply the full database description, the model has to find the right tables and columns in the context.

Next, we want to chunk our dataset and batch it. This is done via the following code snippet:

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

chunk_length = 2048
def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {
        k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()
    }
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])
    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {
        k: concatenated_examples[k][batch_chunk_length:]
        for k in concatenated_examples.keys()
    }
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

# tokenize and chunk training dataset
lm_dataset = dataset_train_tokenized.map(
    lambda sample: tokenizer(sample["text"]),
    batched=True,
    remove_columns=list(dataset_train_tokenized.features),
).map(
    partial(chunk, chunk_length=chunk_length),
    batched=True,
)

# tokenize and chunk validation dataset
lm_dataset_validation = dataset_validation_tokenized.map(
    lambda sample: tokenizer(sample["text"]),
    batched=True,
    remove_columns=list(dataset_validation_tokenized.features),
).map(
    partial(chunk, chunk_length=chunk_length),
    batched=True,
)
# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")
print(f"Total number of samples: {len(lm_dataset_validation)}")

In a scenario where you have a decoder-only model for natural language processing, such as GPT-style models, and you’re working with instruction and solution pairs where the instruction is significantly longer than the solution (e.g., 3000 tokens for instruction and 500 tokens for the solution) and a chunk length is smaller than the chunk length, consider the following aspects.

Tokenization and Chunking
Instruction (e.g., 3000 tokens) and Solution (e.g., 500 tokens): Both the instruction and the solution are tokenized. Given an example chunk size of 2048 tokens, the instruction exceeds this limit significantly.
The instruction is, therefore, chunked into segments. The first chunk will contain 2048 tokens from the instruction. The remaining 952 tokens become part of the next chunk.
The solution, being only 500 tokens, can fit entirely in a chunk without needing to be split.
Sequence Formation
In decoder-only models, the input is often formatted as a single sequence where the instruction and solution are concatenated, typically with a special token separating them (in our case, we train on ### Answer)
Dealing with Length Discrepancy
First Chunk: The first 2048 tokens of the instruction are processed. But since there’s no room left in this chunk for the solution, the solution is not included here.
Second Chunk: The remaining 952 tokens of the instruction are placed in the next chunk, and here, the 500-token solution can also be included, as the total tokens (952 + 500) are within the 2048 token limit.
Casual Language Modeling (CLM)
Decoder-Only Model: This type of model generates text one token at a time, predicting the next token based on the previous tokens. It does not use the bidirectional context like encoder models.
Training: During training, the model learns to predict the next token in the sequence. For our dataset it learns to continue the text from the instruction to the solution.
Attention Masking: The model uses attention mechanisms to weigh the importance of different tokens in the sequence when predicting the next token. In a chunk containing both instruction and solution, it learns the transition from the instructional context to the solution context.

Training Implications
The main challenge is that the model may not always see the instruction and solution together in the same chunk, especially for very long instructions. This can impact its ability to learn the relationship between specific instructions and their solutions.
Partial Context: In cases where the instruction is cut, the model gets only a part of the instruction in one chunk and the rest with the solution in the next. This affects the learning process, as the model doesn’t always see the complete instruction with the corresponding solution.

In many datasets, instruction lengths are often far below the chunk window. However, for complex databases, schema information can get very long and exceed the chunk length. The implication is that we are not training optimally on our most complex examples.

To understand the impact of the chunk length in our dataset, we kicked off four training jobs with 256, 512, 1024, and 4096 tokens.

Finally, upload your datasets to Amazon S3:

# Function to upload a directory to S3 bucket and verify upload
def upload_directory_to_s3(bucket_name, directory_path, s3_prefix):
    s3 = boto3.client("s3")

# Walk through each file in the directory
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            local_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_path, directory_path)
            s3_path = os.path.join(s3_prefix, relative_path)
            # Upload file to S3
            s3.upload_file(local_path, bucket_name, s3_path)
            print(f"Uploaded {local_path} to s3://{bucket_name}/{s3_path}")
            # Verify the upload
            try:
                # Get the metadata of the uploaded file
                response = s3.head_object(Bucket=bucket_name, Key=s3_path)
                s3_file_size = response["ContentLength"]
                # Compare the file size
                local_file_size = os.path.getsize(local_path)
                if local_file_size != s3_file_size:
                    print(f"Size mismatch for file: {local_path}")
                else:
                    print(f"Successfully verified the upload of {local_path}")
            except Exception as e:
                print(f"Error verifying file upload: {e}")

# Define your bucket name
bucket_name = sess.default_bucket()
# Upload the directories to S3
upload_directory_to_s3(
    bucket_name, train_dataset_name, f"processed/codellama/nl2sql/train_ddl_{chunk_length}_tokens"
)
upload_directory_to_s3(
    bucket_name, validation_dataset_name, f"processed/codellama/nl2sql/validation_ddl_{chunk_length}_tokens"
)
# Define the S3 paths
training_input_path = f"s3://{bucket_name}/processed/codellama/nl2sql/train_ddl_{chunk_length}_tokens"
validation_input_path = f"s3://{bucket_name}/processed/codellama/nl2sql/validation_ddl_{chunk_length}_tokens"
print("Uploaded data to:")
print(f"Training dataset: {training_input_path}")
print(f"Validation dataset: {validation_input_path}")

3. Fine-Tuning CodeLlama with QLoRA on Amazon SageMaker

In order to optimize CodeLlama using the QLoRA methodology, we’ve drawn inspiration from the invaluable insights shared in Phil Schmid’s blog post on fine-tuning LLaMA models with QLoRA on SageMaker.

From the same source we can reuse the run_clm.py, which implements QLoRA using PEFT to train a model. Post-training, this script integrates the LoRA weights into the model’s architecture when setting merge_weights=True. For models that exceed memory capacity, temporary offloading to disk is implemented.

Please take a look at the code snippet that merges the adapter weights:

sagemaker_save_dir="/opt/ml/model/"
    if args.merge_weights:
        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)
        # clear memory
        del model
        del trainer
        torch.cuda.empty_cache()

from peft import AutoPeftModelForCausalLM
        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            output_dir,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
        )
        # Merge LoRA and base model and save
        model = model.merge_and_unload()
        model.save_pretrained(
            sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
        )
    else:
        trainer.model.save_pretrained(
            sagemaker_save_dir, safe_serialization=True
        )

To initiate a SageMaker training job, we utilize a HuggingFace Estimator. This HuggingFace estimator simplifies the Amazon SageMaker training and deployment process by managing the necessary infrastructure. SageMaker orchestrates the setup and management of required EC2 instances, supplies the appropriate HuggingFace container, uploads scripts, and downloads data from our S3 bucket to the container at /opt/ml/input/data, before commencing the training job.

For even easier LLM fine-tuning, you can try Amazon SageMaker Jumpstart, which allows you to fine-tune a large set of models with a click of a button. At the time of writing, CodeLlama was not available on Amazon SageMaker Jumpstart.

An important aspect to keep in mind is that the HuggingFace Estimator version available on SageMaker may not always be in sync with the latest release of the Transformers library. To address this, ensure that your environment is running the desired version of the Transformers library by specifying it in a requirements.txt file, which allows you to upgrade to a specific version or even install the latest version directly from the GitHub repository. By leveraging this approach, you gain the flexibility to work with the current features and updates from the transformers library, ensuring that your SageMaker environment is equipped with the cutting-edge tools needed for your machine-learning projects.

Hardware requirements

Please take a look at a few selected options on how to run this training job.

Different chunk lengths, instance types, and successful training runs.

You can also use g5.2xlarge instead of the g5.4xlarge instance type, but then it is not possible to use the merge_weights parameter since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them merge_adapter_weights.py after training.

import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define training job name; no special chars allowed, no "_"
job_name = f'huggingface-nltosql-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
# hyperparameters, which are passed into the training job through CLI parsing
hyperparameters = {
    "model_id": model_id,  # pre-trained model
    "dataset_path": "/opt/ml/input/data/training",  #for saving the dataset on the training container
    "epochs": 5,  # number of epochs
    "per_device_train_batch_size": 4,  # batch size for training
    "lr": 2e-4,  # learning rate
    "hf_token": HfFolder.get_token(),  # huggingface token to access CodeLlama
    "merge_weights": True,  # Merging model weights with LoRA adapters after training
    "report_to": "wandb",  # Inspection of the training process via weights and biases
}
# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="run_clm.py",  # train script 
    source_dir="scripts",  # dir that will be copied over to the training container
    instance_type="ml.g5.12xlarge",  # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    base_job_name=job_name,  # the name of the training job
    role=role,  # IAM role used in training job to access AWS resourcee; use your SageMakerExecutionRole
    volume_size=300,  # the size of the EBS volume in GB
    transformers_version="4.28",  # the transformers version of the container
    pytorch_version="2.0",  # the pytorch_version version used in the training job
    py_version="py310",  # the python version used in the training job
    hyperparameters=hyperparameters, 
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache"
    },  # set env variable to cache models in /tmp
    keepAlivePeriod=600, # needed to allow time for downloading larger models
)

Start the training job with the .fit() method. The training container receives the training and validation dataset directly from s3.

# define a data input dictonary with our uploaded s3 uris
data = {"training": training_input_path, "validation": validation_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=False)

The training job will then start to run on AWS. Fine-tuning the 7B parameter model for 5 epochs on a ml.g5.4xlarge instance and a chunk length of 2048 took 17 hours in the us-east-1 region. Which equates to a cost of roughly 35 USD.

As you might lose the kernel while the training job is executing on AWS, you can always attach back to a training job as such:

# In case your can't remember the training job name, you can look it up via the management console or via the API
# Create a SageMaker client
sagemaker_client = boto3.client('sagemaker')
# List all training jobs and get the latest training job
response = sagemaker_client.list_training_jobs(
    SortBy="CreationTime", SortOrder="Descending"
)
# Print the training job names
for job in response['TrainingJobSummaries']:
    print(job['TrainingJobName'])

TrainingJobName = "YOUR TRAINING JOB NAME"

huggingface_estimator = HuggingFace.attach(TrainingJobName)

In case you want to perform local testing, further analysis, or deployment of the model outside of SageMaker, you can of course, download it. Please see the snippet below.

from sagemaker.s3 import S3Downloader

S3Downloader.download(
    s3_uri=huggingface_estimator.model_data,  # S3 URI where the trained model is located
    local_path="./codellama/",  # local path where *.targ.gz is saved
    sagemaker_session=sess,  # SageMaker session used for training the model
)

4. Deployment to an Amazon SageMaker Real-Time Endpoint

Now that we have laid the foundations on how to fine-tune Code Llama on your own dataset, we need to deploy the model.

With SageMaker, this is relatively simple, as it automatically saves a zipped version of your model to s3 when you successfully complete a training job.

It is important that the archive directly contains all files and not a folder with the files — Amazon SageMaker takes care of that for you. If you train outside of SageMaker, your file should look like this:

```
model.tar.gz/
|- config.json
|- model-00001-of-00005.safetensors
|- tokenizer.json
|- …
```

You can use the pigz-python package to parallelize the archiving.

Hugging Face TGI container in Amazon SageMaker

There are many options you can choose from to run inference on your model, e.g., when creating benchmarks, one could simply extend the run_clm.py script to include a pass on the benchmark dataset at the end.

Another way is to spin up a SageMaker endpoint with the Text Generation Inference container from Huggingface. First, retrieve the container URI that points to the desired Docker image. Amazon SageMaker’s get_huggingface_llm_image_uri method, being part of the Amazon SageMaker SDK, facilitates this. It allows for the acquisition of the Hugging Face LLM DLC’s URI, tailored to your specific requirements, including backend, session, region, and version. To explore available versions, refer to the list of available images.

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri("huggingface", version="1.0.3")
# print ecr image uri
print(f"llm image uri: {llm_image}")

Deploying the Fine-Tuned Code Llama on Amazon SageMaker

To deploy models such as CodeLlama on Amazon SageMaker, we create a HuggingFaceModel class. This class forms the basis of our endpoint configuration, encompassing parameters like hf_model_id, instance_type, and others. We opt for the ml.g5.12xlarge instance type, boasting 4 NVIDIA A10G GPUs and a significant 96GB of GPU memory. Feel free to use a smaller instance for your deployment. You can review the instance details on the official aws ec2 website.

As we have seen from our token length plot, we need to be able to handle the longest question. Therefore, increase the MAX_INPUT_LENGTH parameter to accommodate that.

### Deploying the Fine-Tuned Code Llama on Amazon SageMaker
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 500
# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "/opt/ml/model",  # path to where sagemaker stores the model
    "SM_NUM_GPUS": json.dumps(number_of_gpu),  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": json.dumps(3072),  # Max length of input text
    "MAX_TOTAL_TOKENS": json.dumps(
        4096
    ),  # Max length of the generation (including input text)
    # 'HF_MODEL_QUANTIZE': "bitsandbytes",# Comment in to quantize
}
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role, image_uri=llm_image, model_data=s3_model_uri, env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
    container_startup_health_check_timeout=health_check_timeout,  # 10 minutes to be able to load the model
)

SageMaker will now create our endpoint and deploy the model to it. This can take up to 10 minutes.

Running a benchmark on Spider validation dataset

We load the Spider SQL dataset for validation purposes and prepare to run a benchmark test. Prepare the dataset for the model in the same way we fine-tuned it. However, we exclude the answer to the question — it is our models job to predict this one.

# Load dataset from the hub
dataset = load_dataset("philikai/Spider-SQL-LLAMA2_train")

def format_spider_validation(sample):
    instruction_prompt = f"""Given an input question, use sqlite syntax to generate a sql query by choosing one or multiple of the following tables. 
    The foreign and primary keys will be supplied. Write query in between <SQL></SQL>. 
    Answer the following question with the context below: \n{sample['question']}"""
    instruction = f"### Instruction\n{instruction_prompt} "
    context = f"### Context\n{sample['schema']} | {sample['foreign_keys']} | {sample['primary_keys']}"
    response = f"### Answer\n" # LLM Prediction comes here: 
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

We test a single example from the validation set to ensure the formatting and model interaction are working as expected.

from random import randrange

example_nr = randrange(len(dataset["validation"]))
print(f"Example number: {example_nr}")
random_example = format_spider_validation(dataset["validation"][example_nr])
print(f"Picked this example: \n\n{random_example}")
print("*" * 190)
print(dataset["validation"][example_nr]["query"])

""" 
Example number: 410
Picked this example: 

### Instruction
Given an input question, use sqlite syntax to generate a sql query by choosing one or multiple of the following tables. 
The foreign and primary keys will be supplied. Write query in between <SQL></SQL>. 
Answer the following question with the context below: 
What are the names of the teachers whose courses have not been arranged? 
### Context
[Schema (values) (types)]: | course_teach |  course : course_id (text) , staring_date (number) , course (text) | teacher : teacher_id (text) , name (number) , age (text) , hometown (text) | course_arrange : course_id (text) , teacher_id (number) , grade (text); | [Foreign Keys]: course_arrange : teacher_id = teacher : teacher_id | course_arrange : course_id = course : course_id | [Primary Keys]: course : course_id, teacher : teacher_id, course_arrange : course_id
### Answer
******************************* 
SELECT Name FROM teacher WHERE Teacher_id NOT IN (SELECT Teacher_id FROM course_arrange)"""

Now we can send a request to our endpoint:

# hyperparameters for llm
payload = {
    "inputs": random_example,
    "parameters": {
        "do_sample": True,
        "top_p": 0.95,
        "temperature": 0.001,
        "max_new_tokens": 256,
        "repetition_penalty": 1.03,
        "stop": ["</s>"],
    },
}

# send request to endpoint
response = llm.predict(payload)

# print(response[0]["generated_text"][:-len("<human>:")])
print(response[0]["generated_text"])

<SQL> SELECT name FROM teacher WHERE teacher_id NOT IN (SELECT teacher_id FROM course_arrange) </SQL>

Exactly, as we want it to be. We have the <SQL> </SQL> tags for easier parsing.

Proceed to run the actual benchmark on the entire validation dataset. This process involves iterating over the dataset, sending each formatted sample to the model, and collecting the responses. At this point, we are calling the endpoint synchronously and waiting for the endpoint’s response to each query.

from time import time
from tqdm import tqdm

answers = []
dataset_val = dataset["validation"]
start = time()
for idx, sample in tqdm(enumerate(dataset_val), total=len(dataset_val)):
    formatted_sample = format_spider_validation(dataset_val[idx])
    # hyperparameters for llm execution
    payload = {
        "inputs": formatted_sample,
        "parameters": {
            "do_sample": True,
            "top_p": 0.95,
            "temperature": 0.001,
            "max_new_tokens": 256,
            "repetition_penalty": 1.03,
            "stop": ["</s>"],
        },
    }
    # send request to endpoint
    response = llm.predict(payload)
    answer = response[0]["generated_text"]
    answers.append(answer)
duration = time() - start
avg_duration = duration / len(dataset_val)

The benchmark only takes 16:25 minutes, which means the endpoint takes an average of a second to make a prediction.

Save the benchmark answers on disk.

import pickle
import os

# Specify the directory and filename
dir_name = "./results"
file_name = "yourfilename.pickle"
# Check if the directory exists, if not, create it
if not os.path.exists(dir_name):
    os.makedirs(dir_name)
# Write the data to the file
with open(os.path.join(dir_name, file_name), "wb") as fp:
    pickle.dump(answers, fp)

Analyze the results to determine how many of the model’s responses correctly follow the expected format.

# Initialize a count
count = 0

# Loop through the answers
for answer in answers:
    if answer.startswith("<SQL>") and answer.endswith("</SQL>"):
        count += 1
print(f"{count} answers have <SQL> at the beginning and </SQL> at the end.")

1034 answers have <SQL> at the beginning and </SQL> at the end.

For our experiment, we get 100% format alignment, as all of the 1034 samples have the <SQL> tags at the beginning and end.

Clean up

Finally, to manage resources effectively, we clean up by deleting the model and endpoint. This is a crucial step to avoid unnecessary charges and keep the environment tidy.

llm.delete_model()
llm.delete_endpoint()

5. Execution Match improvements

To judge the improvements we have made through fine-tuning, we run a SQL benchmarking script to calculate the execution match (EM)accuracy.

Execution Match Accuracy

To obtain the EM accuracy, we extract and execute each SQL query generated by our model. Next, run the ground-truth “golden” query against your database. If the returned value of both queries matches up, we count it as one true positive. An exemplary script to evaluate the execution match accuracy for all results in a specified directory can be found below. The df_def.feather holds all the needed information to run all examples.

import pandas as pd
import sqlite3
import pickle
import os
import re
import sys
from pathlib import Path

path = Path(os.path.dirname(__file__))
sys.path.append(str(path.parent.parent.absolute()))

def extract_sql_content(text):
    pattern = r"<SQL>(.*?)</SQL>"
    matches = re.findall(pattern, text, re.DOTALL)
    return matches[0]

def clean_results(answerlist: list, return_failed: bool):
    if return_failed:
        failed_dict = {}
    # Clean the list of strings
    clean_list = []
    for idx, item in enumerate(answerlist):
        # extracting the SQL statement from the <SQL> tags
        try:
            item = extract_sql_content(item)
        except Exception as e:
            failed_dict[idx] = {"model_output": item}
        # Remove any leading or trailing whitespace
        item = item.strip()
        # Remove any trailing double quotes
        if item.startswith('"') and item.endswith('"'):
            item = item.rstrip('"')
            item = item.lstrip('"')
        # Remove any newlines
        item = item.replace("\n", " ")
        # Add the cleaned item to the clean_list
        clean_list.append(item)
    return clean_list, failed_dict

def run_exact_match_bench(
    df,
    model,
):
    results = []
    counter = 0
    for idx in range(0, df.shape[0]):
        sql_query = df.iloc[idx]["query"]
        prediction_query = df.iloc[idx][model]
        db_id = df.iloc[idx]["db_id"]
        db_file_path = (
            f"{path.parent.absolute()}/spider/database/{db_id}/{db_id}.sqlite"
        )
        conn = sqlite3.connect(db_file_path)
        cursor = conn.cursor()
        try:
            # Fetching the gold standard
            cursor.execute(sql_query)
            result_gold = cursor.fetchall()
            gold_query = f"gold Query: {sql_query}\n"
            gold_result = f"gold result: {result_gold}\n"
            try:
                # Fetching prediction results
                cursor.execute(prediction_query)
                result_preds = cursor.fetchall()
                pred_query = f"prediction Query: {prediction_query}\n"
                pred_result = f"prediction: {result_preds}\n"
            except Exception as e:
                pred_query = f"prediction Query: {prediction_query}\n"
                pred_result = f"error: {e}\n"
                results.append(gold_query + gold_result + pred_query + pred_result)
                continue
            # Comparing the results
            if result_gold == result_preds:
                match = "match\n"
                counter += 1
            else:
                match = "no match\n"
            results.append(gold_query + gold_result + pred_query + pred_result + match)
        except:
            error = "General error\n"
            results.append(error)
    exact_execution_match_accuracy = counter / df.shape[0]
    print(f"Accuracy: {exact_execution_match_accuracy}")
    return exact_execution_match_accuracy

def orchestrate_bench(df_eval, model_id: str, model_id_answers):
    model_id_answers_cleaned, queries_cleaning_failed = clean_results(
        model_id_answers, return_failed=True
    )
    df_eval[model_id] = model_id_answers_cleaned
    exact_match_accuracy = run_exact_match_bench(df_eval, model_id)
    return exact_match_accuracy, df_eval

def run_bench_on_folder(df_eval, results_folderpath):
    results_dir = {}

    df = df_eval.copy()
    for root, dirs, files in os.walk(results_folderpath):
        for result_file_name in files:
            filepath = os.path.join(root, result_file_name)
            if result_file_name.startswith("answers"):
                print(f"working on {filepath}")
                with open(filepath, "rb") as fp:  # Unpickling
                    model_id_answers = pickle.load(fp)
                exact_match_accuracy = orchestrate_bench(
                    df, result_file_name, model_id_answers
                )
                results_dir[result_file_name], df = exact_match_accuracy
    return results_dir
df_eval = pd.read_feather(f"{path.parent.absolute()}/data/dev_df.feather")
path_to_results_folder = f"{path.parent.absolute()}/results/"
results_dir = run_bench_on_folder(df_eval, path_to_results_folder)
print(results_dir)

Execution match accuracy of the models after fine-tuning

The following results are from models that have been fine-tuned with a chunk length of 2048 tokens and an inference setting of a repetition penalty of 1.03.

Execution match accuracy for different training configurations.

Substantial improvements can be achieved when fine-tuning CodeLlama models.

Conclusion

Throughout this first part of the blog series, the reader should get an intuition of the main components that need to be taken into account when fine-tuning CodeLlama on Amazon Sagemaker for NL2SQL.

For more experiments and performance tests on chunk length, repetition penalty, training with database schemas through the Data Description Language (DDL), and more, stay tuned for part two of the blog.

For the code, head over to Github!