Notes on fine-tuning Llama 2 using QLoRA: A detailed breakdown

23 min readSep 19, 2023

I’ve been following recent developments around LLMs but not really tinkering with any of the open-source models, libraries, etc. And if you have been following then you know there has been a lot of stuff coming out lately.

While trying to play with this code I realized that there is so much to keep up with, I found myself asking a lot of questions, when did this library come out? what more does it offer? What technique does this class implement?

In this article, I will attempt to present some answers to these questions in the style of detailed notes. The questions are specific to the code, therefore, most of the notes are specific to questions regarding the open-source libraries involved and the methods and classes used. I will discuss some of the theoretical aspects of fine-tuning an LLM, but I won’t dive too deep, just summaries. While presenting these notes I will share many links to detailed discussions about topics like weight quantization, parameter-efficient fine-tuning techniques, library documentation, papers, etc. Be sure to check out all the links, if you cannot read them immediately bookmark them for later.

At the end of the article, if you were like me feeling a bit lost with all the recent developments in the fine-tuning-open-source-LLMs space, you’ll probably have a better understanding of how everything fits together.

Let me first summarize what exactly the code is about. It shows us how to fine-tune Llama 2–7B (you can learn more about Llama 2 here) on a small dataset using a finetuning technique called QLoRA, this is done on Google Colab Notebook with a T4 GPU. Both the model and the dataset are available from HuggingFace. I’ll take the code in sections and then present important notes on the relevant lines in the code as well as any other context needed.

Libraries and Imports

The first thing I noticed was that there were some unfamiliar Python libraries required.

!pip install -q peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

As you can see above we have to first install accelerate, peft, bitsandbytes, transformers, and trl. Honestly, the only library I was familiar with here was transformers, so I’m going to provide some notes on each of these libraries and what they do.

transformers: The transformers library is definitely the oldest library here with the earliest version (2.0.0) on PyPI dating back to 2019. It’s a huggingface library for quickly accessing (downloading from hugging’s API) machine-learning models, for text, image, and audio. It also provides functions for training or fine-tuning models and sharing these models on the HuggingFace model hub. The library doesn’t have abstraction layers and modules for building neural networks from scratch like Pytorch or Tensorflow. Instead, it provides training and inference APIs that are optimized specifically for the models provided by the library. The repository has more info on why you should or shouldn’t use transformers for your ML project. You can also see a list of the available models provided by transformers here. It is safe to say that transformers is one of the key Python libraries for LLM finetuning simply because of its easy toolbox for accessing open-source models. Indeed the huggingface community (and the company) should be given much credit for the work they are doing as one of the major players in democratizing open-source machine learning. I love this post that says that by sorting the models on HuggingFace by downloads, you can get a pretty good idea of which models are easiest to work with.

bitsandbytes: bitsandbytes is a relatively newer library with the earliest release on PyPI dating back to 2021. It is a lightweight wrapper around CUDA custom functions, specifically designed for 8-bit optimizers, matrix multiplication, and quantization. It provides functionality for optimizing and quantizing models, particularly for LLMs and transformers in general. It also offers features such as 8-bit Adam/AdamW, SGD momentum, LARS, LAMB, and more. I believe that the goal for bitsandbytes is to make LLMs more accessible by enabling efficient computation and memory usage through 8-bit operations. By leveraging 8-bit optimization and quantization techniques we improve the performance and efficiency of models. If you have been following the conversation around open-source LLMs, you probably know that there is a memory bottleneck concerning running LLMs on smaller-sized consumer GPUs such as the RTX 3090. Therefore there has been a surge in interest in weight quantization techniques that attempt to reduce the memory requirements for running LLMs. The idea is to quantize the floating point precision of the weights of the model from larger precision points like FP32 to a smaller precision like Int8 (a 4x4 Float16). There are techniques to quantize say an FP32 to Int8 including absmax and zero-point quantization but due to limitations with these techniques, the creator of the bitsandbytes library coauthored the LLM.int8() paper as well as 8-bit Optimizers to provide efficient quantization methods for LLMs. The bitsandbytes library thus provides these quantization techniques as an open-source library. Earlier I said that we’ll be running the code for fine-tuning Llama 2 which is a 7B parameter model on a free Colab T4 GPU. This is largely possible due to quantization techniques provided by the bitsandbytes library. Here is a great article on weight quantization and another from HuggingFace.

Peft: Weight quantization will allow us to reduce the memory requirements of loading an LLM (or parts of it) into working memory for fine-tuning. However, there is still the problem of efficiently fine-tuning LLMs. Unlike transfer learning techniques with smaller deep learning models, where you simply had to freeze the lower layers of a neural network like AlexNet and then fully fine-tune the classification layers on a new task, with LLMs there are huge prohibitive costs to performing such full fine-tuning. Parameter Efficient Fine-Tuning (PEFT) methods are a set of methods for adapting LLMs for downstream tasks such as summarization or question-answering on memory-constrained devices such as the T4 GPU (the T4 provides 16GB VRAM). The motivation for these methods is that we can fine-tune efficiently, parts of the LLM and still achieve comparable results than if we do a full fine-tuning. There are a few of these methods such as LoRA and Prefix Tuning that are quite successful and widely used in the literature. The peft library is a HuggingFace library that provides these fine-tuning methods, it’s a new library dating back to January 2023. In this tutorial, we’ll be using the QLoRA technique which is a low-rank adaptation or fine-tuning technique for quantized LLMs.

trl: Another HuggingFace library, trl has been around (since 2021) but active development peaked in January 2023 and has continued to peak. TRL stands for Transformer Reinforcement Learning and is a library that provides implementations of different algorithms in the various steps for training and fine-tuning an LLM. Including the Supervised Fine-tuning step (SFT), Reward Modeling step (RM), and the Proximal Policy Optimization (PPO) step. trl also has peft as a dependency so that you can for instance use an SFT Trainer with a PEFT method such as LoRA see here for example. The utility of using TRL (as described in the repository) is that you can perform the full end-to-end training of an LLM using the trainers provided.

datasets: Finally, although not included in our list of installed packages from earlier (it comes installed with transformers), the datasets library is another one from the huggingface ecosystem. The library is important for one-line dataloaders for many public datasets hosted on the HuggingFace dataset hub as well as for efficient data pre-processing.

These libraries complement each other and are certainly crucial for any sort of work with LLMs. I have made a simple image to summarize how these libraries fit together below.

How these libraries complement each other (a -> b means a complements b)

You can access all the HuggingFace library documentation from here.

Next, let’s take a look at the imports

Judging from some of the names of the imports and the context of what each of the libraries does, you might have an idea of what we’ll be using these imports for. But to be sure, let’s take some notes on each import, we’ll keep it simple for now focusing only on what they do. We’ll dive deep into details about the initialization parameters later on.

torch: I’m sure you are familiar with the Pytorch machine learning library. Usually, you’d import torch to build neural networks and train them using torch’s data classes and optimizers. Here, however, we won’t need those kinds of low-level functionality from torch because we are primarily using other libraries like transformers and trl for that. We’ll be using torch here to get dtypes (data-types) like torch.float16 and also to check device capability (more on this later). That’s all.

load_dataset: load_dataset does exactly what the name implies, but one detail is that it loads our dataset from the HuggingFace dataset hub here. So it’s an online loader, but it’s efficient and simple, requiring just one line of code.

dataset = load_dataset(dataset_name, split="train")

AutoModelForCausalLM: Recall that we are accessing the Llama 2–7b model that we’ll be fine-tuning from the HuggingFace model hub using the transformers library. More specifically we’ll be accessing this particular model here. The transformers library provides a set of classes called Auto Classes that given the name/path of the pre-trained model, can infer the correct architecture and retrieve the relevant model. This AutoModelForCausalLM is a generic Auto Class for loading models for causal language modeling. Note that there are two types of language modeling, causal and masked. Causal language models include; GPT-3 and Llama, these models predict the next token in a sequence of tokens to generate semantically similar text to the input data. The AutoModelForCausalLM class will retrieve the causal model from the model hub and load the model weights thus initializing the model. The from_pretrained() method does this for us.

model_name = "NousResearch/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device_map
)

AutoTokenizer: The AutoTokenizer allows for easy tokenization of text data. It specifically provides a convenient way to initialize and use tokenizers for different models without needing to specify the tokenizer class explicitly. Since it is also a generic Auto Class it can automatically select the appropriate tokenizer based on the model name or path provided. The tokenizer converts input text into tokens, which are the basic units of text used by NLP models. It also provides additional features like padding, truncation, and attention masks. Overall, the AutoTokenizer simplifies the process of tokenizing text data for NLP tasks using transformer models. We can see how we initialize the AutoTokenizer below, later on, we’ll see how the SFTTrainer takes the initialized AutoTokenizer as a parameter.

model_name = "NousResearch/Llama-2-7b-chat-hf"
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

BitsAndBytesConfig: As you know by now, we’ll be using bitsandbytes for quantization. The transformers library recently added full support for bitsandbytes so using the BitsandBytesConfig you can configure any of the quantization methods that bitsandbytes offers such LLM.int8, FP4, and NF4. The way this works is that you pass a quantization configuration to your AutoModelForCausalLM initializer so that it makes use of the configured quantization method to load the model weights.

#bits and byte config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, #pass to AutoModelForCausalLM
    device_map=device_map
)

TrainingArguments: TrainingArguments’s utility is pretty straightforward. It is a data class for storing all the training arguments for the SFTTrainer. The SFFTrainer takes different types of arguments that are not necessarily specific to training. So, TrainingArguments helps us to organize all the related training arguments into a single data class and keeps the code clean and organized. Also, there are a bunch of nice utilities that can be used with TrainingArguments, for instance using HfArgumentParser we can create an argument parser for TrainingArguments that is useful for CLI applications. In the code below, we’ll pass training_arguments to SFFTrainer later.

#TrainingArguments
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

pipeline: We’ll be using pipeline for inference after we are done with fine-tuning. I think the pipeline is a great utility, there is a list of various pipeline tasks (see here) you can choose from like, “Image Classification”, “Text Summarization” etc. You can also select a model to use for the task, but you can also choose not to and the pipeline will use a default model for the task. You can add an argument that does some form of preprocessing like tokenization or feature extraction. For reference here is how we initialize a pipeline for inference later on:

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

logging: The final import from transformers is logging. It’s a centralized logging system to set up the verbosity of the library. There are various levels you can set like CRITICAL, ERROR, INFO etc. It’s a global logging system that is very useful when debugging your transformers code.

logging.set_verbosity(logging.CRITICAL)

LoraConfig: From the peft library we import the LoraConfig data class. LoraConfig is a configuration class to store configurations required to initialize the LoraModel which is an instance of a PeftTuner. We’ll then pass this config to the SFTTrainer it will use the config to initialize the appropriate Tunerwhich again, in this case, is the LoraModel.

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

PeftModel: Once we fine-tune a pre-trained transformer using one of the peft methods such as LoRA, we can save the LoRA adapter weights to disk as well as load them back into memory. PS: Adapters are basically the weights that PEFT modules fine-tune, these are separate from the base-model weights. Using PeftModel, you also have the option of loading the adapter weights into memory and then merging (adapting) the base_model weights with the newly fine-tuned adapter weights. This is precisely what we’ll use PeftModel for, we’ll use PeftModel.from_pretrained()to load the adapter weights from memory and merge them with the base_model using merge_and_unload().Here is what this looks like in the code.

# Reload base_model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

SFTTrainer: Finally, the last import, and arguably the most important is the SFTTrainer from trl. The SFTTrainer is a subclass of transformers Trainer class. Trainer is a feature-complete training API for transformer models. SFTTrainer builds on this with added support for parameter-efficient fine-tuning. The supervised fine-tuning step is a key step in training causal language models like Llama for downstream tasks like instruction-following. For example, the dataset for this tutorial is a small instruction-following dataset with 1k examples. The key idea behind supervised fine-tuning is that the model is trained on a set of validated responses that the model can emulate, that is a set of input-output pairs. Again recall that SFTTrainer supports PEFT, so we will use the SFTTrainer with LoRA. SFTTrainer will then perform the supervised fine-tuning using LoRA. We can then run the trainer (train()) and save the weights as well (save_pretrained()).

#Initialize the SFTTrainer object
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Here’s how each import fits together (a -> b means a is fed into b)

Parameters

Okay, so we now know what libraries we need to fine-tune Llama 2 (or any LLM really), we know the classes we need from these libraries and we have context about what those classes do. Next, let’s look at the parameters that go into all those classes we imported earlier. This will also shed more light on what they do.

Model and dataset names:

First, in lines 2, 5, and 8 we define the model_name , the dataset_name and the new_model . These names follow the format of the HuggingFace model and dataset names on their hub. So for example given a name like “NousResearch/Llama-2–7b-chat-hf” the first part of the name NousResearch is a research organization that has a HuggingFace account where they upload open-source models. Anyone can have an account, and anyone can upload a model. In the last part of this tutorial, you upload the fine-tuned model to HuggingFace under your own account. The second part of the name is the model name Llama-2–7b-chat-hf. Here it is good to give your models descriptive names that include useful info like the distinct model name (Llama-2), key parameter info (7b), and some other useful info about how the model works (chat-hf). We see the same thing with the new_model name llama-2–7b-miniguanaco this is the name we assign to the fine-tuned model. Here we append the name of the dataset that we fine-tuned on miniguanaco . When you upload the fine-tuned model to HuggingFace the name will look like <your-account-name>/llama-2–7b-miniguanaco . Together, the account name and the model are referred to as the path . The dataset is from the path mlabonne/guanaco-llama2–1knotice that the dataset path follows the same format as the model path.

QLoRA Parameters:

In the code there are brief comments about what these parameters are but why do we set those particular values? The QLoRA parameters go into your LoraConfig , there are other parameters not used here so be sure to click the link and check them out in the docs. The parameters we’ll make use of are r (lora_r) , lora_alpha and lora_dropout . These parameters are the most essential for LoRA , to understand why, we’ll have to dive into the LoRA paper here. Let me try to summarize. In neural networks, the backpropagation algorithm calculates the error between the expected value and the actual value, this error eis then used to calculate the delta which is the contribution to e from the weights in the neural network. So if you have the initial weights of a neural network W0 then with respect to the error e we calculate delta_W0 = ∆W . You then use ∆W to update the weight W0 + ∆W in order to reduce the error e . LoRA proposes that ∆W can be decomposed into two sets of Low-Rank matrices A and B such that W0 + ∆W = W0 + BA . Instead of using the full ∆W update, we use the smaller low-rank update matrix BA , this is how we achieve efficiency and lower computational requirements. If the size of ∆W is (d x k) (the size of W0 ) then we decompose ∆W into two matrices: B and A, with dimensions (d x r) and (r x k) , where r is the rank. The parameter r (lora_r)in LoraConfig is the rank that determines the shape of the update matrices BA . According to the paper, you can set a small rank and still get excellent results. When we update W0 we can control the impact of BA by using a scaling factor α , this scaling factor acts as a learning rate. The scaling factor is our second parameter (lora_alpha) . Finally, we set our lora_droput , which is a typical dropout rate for regularization.

BitsandBytes Parameters:

The bitsandbytes parameters go into the BitsandBytesConfig. Recall that we are using a quantized version of LoRA known as QLoRA you can read more about QLoRA in the paper here. What this means is that we want to use quantization for our LoRA fine-tuning, applying quantization to (amongst other things) the update weights we spoke about earlier. To understand more about floating point precision and the difference between the floating point quantization types FP4 (floating point 4) and NF4 (normalized float 4) read these articles here and here. We set the param use_4bit (line 6)to True to use high-fidelity 4-bit fine-tuning, this was later introduced in the QLoRA paper to achieve even lower memory constraints than the 8-bit quantization introduced in the LLM.int8 paper. We set the bnb_4bit_compute_dtype (line 9)which is the data type (float16) that computation is performed with. That is while 4-bit quantization stores the weights in 4-bit to reduce memory usage, the computation happens in 16 or 32-bit and any combination can be chosen (float16, bfloat16, float32, etc.). With bnb_4bit_quant_type (line 12)we set nf4 which has shown better theoretical and empirical performance according to the QLoRA paper. We set the final parameter defined on line 15 use_nested_quant to False and pass it to bnb_4bit_use_double_quant . Setting this parameter to True will enable a second quantization after the first one to save an additional 0.4 bits per parameter. This is useful if you have serious memory problems. In our case for this tutorial, after we choose NF4 quantization with FP16 (float16) precision for computation we should have no memory constraints on the Colab T4 GPU (16 GB VRAM). To see how this works, if we use Llama-2–7B (7 billion params) with FP16 (no quantization) we get 7B × 2 bytes = 14 GB (VRAM required) . Using 4-bit quantization we get 7B × 0.5 bytes = ~ 4 GB (VRAM required) .

Training Arguments:

The training arguments contain the bulk of arguments defined. We’ll take brief notes on these arguments one after the other:

output_dir (line 6): here we set where the model predictions and checkpoints are stored. If you use a visualization tool like tensorboard, this is where training logs are stored and retrieved for visualization.

num_train_epochs (line 9): We can set a training epoch of 1 because our dataset is not very large (1K samples). Each epoch contains 250 steps.

fp16 and bf16 (lines 12 & 13): We set both of these to false as we won’t be using mixed-precision training (read about mixed-precision training here) to reduce memory requirements. We have QLoRA for that already.

per_device_train_batch_size & per_device_eval_batch_size (lines 16 & 19): We set both of these to 4. Usually, you can set a higher batch size (>8) if you have enough memory, this will speed up training.

gradient_accumulation_steps (line 22): “gradient accumulation steps” refers to the number of forward and backward passes (update steps) you perform before actually updating the model weights. During each of these forward and backward passes, gradients are computed and accumulated over a batch of data. After accumulating gradients for the specified number of steps, you then perform a backward pass that computes the average gradient over those steps and updates the model weights accordingly. This approach helps in effectively simulating a larger batch size for gradient updates, which can be beneficial when you have GPU memory constraints or want to stabilize training. It reduces the memory requirements for each individual forward and backward pass while still allowing you to update the model’s weights. The default value is 1, but you can play with higher steps to see how this improves training performance.

max_gradient_norm (line 25): Gradient clipping involves scaling down the gradients if their norm (magnitude) exceeds a certain threshold, which is specified by the max_grad_norm parameter. If the gradient norm is greater than max_grad_norm, the gradients are scaled down so that their norm becomes equal to max_grad_norm. If the gradient norm is already below max_grad_norm, no scaling is applied. It’s advisable to start with a higher value for max_grad_norm and then slowly scale it down over multiple training iterations to see how this affects performance. However, for more conservative control over the training loss (especially since we have a small dataset), we can start with a lower threshold like 0.3.

learning_rate (line 28): The learning rate for AdamW . AdamW is a variant of the popular Adam optimizer. It combines techniques from both the Adam optimizer and weight decay regularization. The learning rate is used by the AdamW optimizer to determine the step size for updating the model’s parameters during training.

weight_decay (line 31): Weight decay, also known as L2 regularization or weight regularization, is a regularization technique commonly used in machine learning and deep learning to prevent overfitting of a model to the training data. It works by adding a penalty term to the loss function that encourages the model’s weights to be small. It makes sense that we are using AdamW and weight decay because weight decay can be especially useful during fine-tuning because it helps prevent overfitting and ensures that the model adapts to the new task while retaining some of the knowledge from the pre-training. However, the value for our weight decay (0.001), imposes a relatively mild penalty to the loss function, larger values would impose a stronger penalty.

optim (line 34): Although by default we use the AdamW optimizer, here we specify a variant of the optimizer to use. We use the “paged_adamw_32bit” optimizer but unfortunately, I couldn’t find detailed information specifically about the Paged AdamW 32-bit optimizer itself. It seems to be a specific implementation or variant of the AdamW optimizer so if you have info on this please leave it in the comments, thanks!

lr_scheduler_type (line 37): Typically we use a learning rate scheduler during the training of deep learning models, to adjust the learning rate over time. The comment (line 36) says constant is a bit better than cosine why? A constant learning rate remains the same throughout the training process while a cosine learning rate is usually used with a cyclical learning rate and oscillates between a maximum and minimum value. You should vary the training between the two types to see which one performs better.

warmup_ratio (line 40): Here we set the “warmup_ratio” to 0.03. Since each epoch has 250 training steps, the warm-up phase will last for the first ~8 steps (3% of 250), during which the learning rate will linearly increase from 0 to the specified initial value 2e-4 . Warm-up phases are often used to stabilize training, prevent gradient explosions, and allow the model to start learning effectively.

group_by_length (line 44): We set this parameter to True and the comment says it speeds up training considerably, why? When group_by_length is set to True, it groups together samples of roughly the same length from the training dataset into the same batch. This means that sequences with similar lengths are grouped together, reducing the amount of padding required. In other words, batches will have sequences of more similar lengths, which minimizes the amount of padding applied. I suspect that the way this can improve training speed is that it enables efficient GPU utilization because GPU processing is typically more efficient when batches have consistent sizes because it allows for parallelism. When batches are more uniform in length due to grouping by length, the GPU can process them more efficiently, leading to faster training times.

save_steps and logging_steps (lines 47 and 50): Here we set both params to 25 to control the interval steps at which to log training information and save checkpoints.

SFTTrainer Parameters:

The final batch of parameters is specific to the SFTTrainer.

max_seq_length: Setting max_seq_length to None allows us not to impose a maximum sequence length limit since our dataset contains sequences of different lengths. In this case, we don’t want to truncate or pad them to a fixed length so setting max_seq_length to None allows us to work with the full range of sequence lengths present in the data.

packing: According to the docs this parameter is used by the ConstantLengthDataset to pack the sequences of the dataset. Setting packing to False in the context of the ConstantLengthDataset can increase efficiency when dealing with multiple short examples, which is the case with our dataset. When packing is set to True, the ConstantLengthDataset packs sequences in a way that each input sequence contains a single example. This means that each input sequence corresponds to one example from the dataset, and these sequences are often padded to a fixed length if they are shorter than the specified maximum sequence length (max_seq_length). By setting packing to False, we allow the ConstantLengthDataset to pack multiple short examples into a single input sequence, effectively combining them. This reduces the need for extensive padding and increases the efficiency of memory usage and computation.

That’s it for the needed parameters. Seeing as we have already dived into the various imports earlier (and what they do) and we just covered the parameters required to initialize them. I believe we have covered the bulk of the work. Next, we’ll see how everything fits together over the next few sections.

Load dataset, base model, and tokenizer

We’ll start by loading the dataset on line 5. Then on line 9, we use the getttr function to set compute_dtype to torch.float16 . On line 10 we initialize our BitsandBytesConfig .

On line 17 we check the compatibility of the GPU with bfloat16 using the function torch.cuda.get_device_capability() . The function returns the compute capability of a CUDA-enabled GPU device. The compute capability represents the version and features supported by the GPU. The function returns a tuple of two integers, (major, minor) which represent the major and minor compute capability scores of the GPU. The major score indicates the major version of the compute capability, while the minor score indicates the minor version. For example, if the function returns (8, 0), it means that the GPU has a compute capability of version 8.0. The major score is 8, and the minor score is 0. If the GPU is bfloat16 compatible then we set compute_dtype to torch.bfloat16 instead because bfloat16 ensures better precision than float16(read this).

Next, we load our base model using AutoModelForCausalLM.from_pretrained just like we discussed earlier. On line 31 we set model.config.use_cache to False , when the cache is enabled, the model’s forward pass behavior can be less variable, as it reuses cached results. Disabling caching introduces a degree of randomness in terms of the order in which computations are performed, which can be useful when fine-tuning see here and here. On line 32 we set model.config.pretraining_tp = 1 the tp there stands for tensor-parallelism and according to tips for Llama 2 here:

setting config.pretraining_tp to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.

Next, we load our Llama tokenizer, using the model_name . If you look at the files for NousResearch/Llama-2 here you’ll notice there is a tokenizer.model file. Using the model_name the AutoTokenizer is able to download that tokenizer.model file. On line 36 we call add_special_tokens({‘pad_token’: ‘[PAD]’}) this is another important tip since the text in our dataset can vary in length, sequences within a batch may have different lengths. To ensure that all sequences in a batch have the same length, padding tokens are added to the shorter sequences. These padding tokens are typically tokens that don’t carry any meaning, such as <pad> or <PAD>. On line 37 we set tokenizer.pad_token = tokenizer.eos_token , by doing this we align the pad token with the EOS token, and we make our tokenizer configuration more consistent. Both tokens (pad_token and eos_token) have a role in indicating the end of a sequence. When they are the same, it simplifies tokenization and padding logic. On line 38 we set the padding side, that is the side where the padding tokens are added to make the batch sequences the same length. This is counter to this, however, setting the padding side to right fixes an overflow issue discovered here. Finally, on line 41, we initialize our LoraConfig , refer back to our discussion on LoraConfig from earlier if you need a refresher.

Training

On line 2 we initialize our TrainingArguments using the parameters we discussed in detail earlier. Then we pass TrainingArguments into our SFTTrainer on line 30 along with the other relevant parameters we discussed. One parameter we didn’t discuss though is dataset_text_field=”text” on line 27. In summary, the dataset_text_field parameter is used to indicate which field in the dataset contains the text data that serves as input to to model. It enables the datasets library to automate the creation of a ConstantLengthDataset based on the text data in that field, simplifying the data preparation process and ensuring efficient training. Mind you, this is all possible because we are using a HuggingFace formatted dataset. If you haven’t noticed by now, the HuggingFace ecosystem is a tight-knit ecosystem of libraries that automate a lot of work behind the scenes for you.

Inference

On line 6 we see the pipeline initialization we discussed earlier. Then we use the pipeline on line 7 by passing our input text constructed using our prompt from line 5. We use <s> to indicate the start of the sequence while [INST] and [/INST] are added as control tokens to indicate the start and end of user messages. Read this chat templating guide to learn more about constructing chat inputs using control tokens.

Reloading the base model with adapter weights

On line 2, we use the AutoModelForCausalLM.from_pretrained to (re)load the base model, of course, we’ll do this without any quantization configurations because we are not fine-tuning it, we just want to merge it with the adapters. Earlier when we discussed PeftModel I talked about why we use it on line 10 tomerge_and_unload the base_model with the new_model (the fine-tuned adapter weights). We also reload the tokenizer on line 13 and make the same modifications we made earlier from lines 13–14.

Saving

Finally, we can then upload both our newly fine-tuned model and its tokenizer to HuggingFace.

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Conclusion

This was a long one but I’m sure if you made it all the way here then it was worth it. Even though it may seem like we covered a lot, we actually just scratched the surface of so many topics. But it is a good start because we can take most of the things we learned here and apply them to the task of fine-tuning any LLM. With respect to fine-tuning Llama 2, you can find more info on some interesting next steps here. For me, some of the things on my mind now are, how can we properly evaluate our fine-tuning performance? Can we fine-tune a larger model (maybe 70B) without spending too much? Working with larger datasets? What about model deployment? I’ll cover some of these next time.

Thanks for reading!