Crafting Your Own Dataset for Fine-Tuning Llama2 in Google Colab: A Step-by-Step Guide (part 2)

5 min readFeb 10, 2024

(This is a continuation from here)

In the last article, we built an instruction-response dataset on the movie Barbie. We will now do the fine-tuning.

**We will make the Llama learn about Barbie**

As part of our routine, let’s begin with some crucial installations. Make sure that you complete these steps, as the landscape is constantly changing and Colab’s default installations may not always be super reliable.

!pip install -U transformers datasets trl peft accelerate bitsandbytes

And here’s our important imports

## important imports
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
import warnings
import datasets
import pandas as pd
import numpy as np
import time

We upload the dataset we created in the last article. Feel free to select alternative datasets of your choice, ensuring you have a way to construct the instruction-response pairs.

from google.colab import files
uploaded = files.upload()

Now, fine-tuning LLMs involves training the model to comprehend a specific text flow, allowing the embedding of crucial information (or knowledge). In this learning paradigm, unlike fine-tuning classifiers like BERT, there is no distinct input-output structure. As a result, our pseudo-input-output, presented as instruction and response, needs to be integrated into a single body of text. In other words, instead of a separate Instruction and Response column, we feed just one column, where both pieces of information will be present. We name that column formatted_instruction.

df["formatted_instruction"] = df.apply(lambda x: f"### Instruction:\n{x['Intructions']}\n\n### Response:\n{x['Responses']}", axis=1)

To access the pre-trained Llama model from HuggingFace, make sure you’re logged in. If you don’t have an account, you can easily sign up and get your token when prompted.

!huggingface-cli login

Unfortunately, it requires separate permission from Meta to get the official Llama release, which can take a while too. Read this to get more details. Alternatively, you can get an unofficial version from the HuggingFace hub.

base_model = "meta-llama/Llama-2-7b-hf" ##official version
## base_model = "NousResearch/Llama-2-7b-chat-hf" ##unofficial version

Get the tokenizer and check the token length in your data. Keep in mind that Llama2 can handle up to 4k tokens for context, but with our limited GPU resources, it’s quite challenging. Here’s a bigger question: Do we really need that much context length?

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

import seaborn as sns
import matplotlib.pyplot as plt

## Visualize the token length
sns.boxplot(x=df["formatted_instruction_tok_len"])

plt.xlabel("formatted_instruction_tok_len")
plt.title("token length")
plt.show()

We found that by choosing a context length of 128, we retained almost all of our data.

df = df[df["formatted_instruction_tok_len"]<=128]

Finally, we need to convert the pandas dataframe to dataset object.

dataset = datasets.Dataset.from_pandas(df)

We will talk about the elephant in the room, now. Yes, loading the model! Researchers have figured out how to load this huge model on our regular machines using techniques like LoRA and QLoRA. We’ll reduce the precision of the numerical values in the model parameters.

compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

Let’s break down what we did here:

1. load_in_4bit: We are replacing the Linear Layer of the model architecture with a FP4 or NF4 layers. If you choose to use load_in_8bit instead, the model size will be double

2. bnb_4bit_quant_type: You can either choose nf4, which is normalized, or fp4, the regular one. There is no significant difference. Read https://arxiv.org/abs/2305.14314v1 for more.

3. bnb_4bit_compute_dtype: Helps speed up significantly.

4. bnb_4bit_use_double_quant: Quantize “again” after once quantized. It is used in nested quantization.

start = time.time()
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1
print("time for model load: {} seconds".format(time.time()-start))

Let’s check the model size:

def calculate_model_size(model):
    total_size = 0
    for param in model.parameters():
        param_size = param.numel() * param.element_size()  # size in bytes
        total_size += param_size

    total_size_in_mb = total_size / (1024**2)  # convert to megabytes
    return total_size_in_mb

# Example usage:
model_size_mb = calculate_model_size(model)
print(f"Model size: {model_size_mb:.2f} MB")

We want PEFT parameters as well.

peft_params = LoraConfig(
    lora_alpha=16, ##The alpha parameter for Lora scaling.
    lora_dropout=0.1, ##The dropout probability for Lora layers.
    r=64, ## Most important hyperparameter. Lora attention dimension, or "rank"
    bias="none",
    task_type="CAUSAL_LM" ## next token prediction: the autoregressive approach.
)

Now, it’s time to load our training parameters. HuggingFace has made our lives so easy!

training_params = TrainingArguments(
    output_dir="./",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

Some important note: if you are using Colab, you have just 15 GB of VRAM on T4 GPU. If per_device_train_batch_size is too large, Colab might face an out-of-memory error. I have set num_train_epochs to 1 for illustration, but you can practically choose up to 5. Also, pay attention to the save_steps parameter; choosing a small value may lead to disk size issues.

We can now set up the trainer. Thanks to HuggingFace we have a built-in supervised fine-tuner (SFTTrainer). Make sure, in the dataset_text_field, you are passing the correct field, which is “formatted_instruction” in our case.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="formatted_instruction",
    max_seq_length=128,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

It’s time to start the training process.

start = time.time()
output = trainer.train()
print("Time taken: ", time.time()-start)

It took 13 minutes to train (or fine-tune) my Llama2 model. Since logging is set to 25 steps, you will see the training loss every 25 steps. Ideally, you should also check the validation performance/loss. We skipped that part here. If you want to use a validation set, provide your dataset information in the test_dataset argument in SFTTrainer.

We can check our gradual decline in train loss using a tensorboard.

from tensorboard import notebook
log_dir = "runs"
notebook.start("--logdir {} --port 4000".format(log_dir))

Now that our model is trained, let’s check how it performs with our instruction

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, 
                max_length=100)
prompt = "###Instruction:\nPlease tell me in brief about the movie
 called \"Barbie\"\n\n###Response\n:"
gen_text = pipe(prompt)
print(gen_text[0]['generated_text'][len(prompt):])

This is the output:

Barbie is a 2023 American fantasy comedy film directed by Greta Gerwig from a screenplay by Gerwig and Noah Baumbach. The film stars Margot Robbie as the title character.

This is fascinating, right?

The code: https://github.com/sadat1971/AutoAnalysis/blob/main/Babrie_Llama.ipynb

References:

Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model (many of my code taken from here)
Fine-Tune Your Own Llama 2 Model in a Colab Notebook
A poor man’s guide to fine-tuning Llama 2 (Helps you with deeper intuition)

Crafting Your Own Dataset for Fine-Tuning Llama2 in Google Colab: A Step-by-Step Guide (part 2)

References:

Written by Sadat Shahriar