Fine-Tune Large Language Model in a Colab Notebook

Step-by-Step Phi-2 Model Fine-tuning

Prasad Mahamulkar
6 min readJan 22, 2024
Image by Author

Large language models (LLMs) have taken the whole world by storm due to their capabilities to understand and generate text in a human-like fashion. LLMs are trained on vast amounts of text data using deep learning techniques. These models are capable of generating human-like text and performing various natural language processing (NLP) tasks. The best part is that you can use LLMs like Falcon, Llama-2, Mistral, Phi-2, etc. for research and commercial purposes. This creates a great opportunity for businesses to fine-tune these pre-trained models and use them in commercial applications with the ability to process confidential data.

In this article, we will learn how to fine-tune Microsoft’s Phi-2, a 2.7 billion-parameter language model.

Now before we jump into fine-tuning, let’s understand the LLM training process first.

There are two main processes in the large language model one is Pretraining and the other one is Finetuning.

Pretraining

Pretraining is the initial phase of training where the model is exposed to a vast amount of unlabeled text data. During this phase, the model learns to understand the structure, patterns, and relationships within the language by predicting the next word in a sequence or filling in missing words. This process allows the model to develop a foundational understanding of grammar, syntax, and semantics.

Fine-tuning

On the other hand, fine-tuning or instruction tuning is the process where the pre-trained model is further trained on the smaller dataset to adapt its knowledge for a specific task or domain. This process tweaks the model’s parameters to perform specific tasks. For example, a model pre-trained on a diverse set of web articles may not immediately perform well on a medical question-answering task. In fine-tuning, there are two methods:

1. Supervised fine tuning (SFT): In SFT, the model is trained on a labeled dataset. The labeled dataset typically contains examples of instruction (input) and response (output) pairs relevant to the task. In this process, the model learns how to respond to specific instructions.

2. Reinforcement Learning from Human Feedback (RLHF): In RLHF, the model interacts with users, generates responses, and receives feedback in the form of reinforcement signals. Basically, the model learns and improves its performance based on the feedback it receives.

RLHF is a more complex and expensive fine-tuning technique than SFT. However, it can be more effective for difficult tasks.

We are going to use the SFT method and for that, we need a dataset which is pre-processed and transformed into the following chat template.

# This is the Chat Temeplate we are going to use:
text = "### Instruction: {Input} ### Assistant: {output}"

You can either transform your dataset using the above chat template or use the dataset from this article, which is ‘prsdm/medquad-phi2-1k,’ available on Hugging Face.

Fine-tune Phi-2

Fine-tune LLM is computationally expensive and requires hundreds of GBs of VRAM for training billion parameter models which is a huge challenge to run or train them on consumer hardware. To solve this problem, we use a parameter-efficient fine-tuning (PEFT) technique called QLoRA (Quantized Low-Rank Adaptation) which is an extension of LoRA (Low-Rank Adapters). This technique reduces memory usage by fine-tuning a small number of extra model parameters while freezing existing parameters. This allows running models in 4-bit precision while maintaining high performance.

let’s start the model fine-tuning process:

The code is available on Google Colab Notebook. Run this notebook using T4 GPU.

First, install and load transformers, accelerate, peft, trl, and bitsandbytes libraries.

# Install and import the necessary libraries
!pip install torch peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0
trl==0.4.7 accelerate einops

import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
AutoTokenizer,
TrainingArguments,
pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer

Then, set up the base model, load the dataset, and configure the tokenizer to ensure uniform token sizes for each sample.

# Model
base_model = "microsoft/phi-2"
new_model = "phi-2-medquad"

# Dataset
dataset = load_dataset("prsdm/medquad-phi2-1k", split="train")

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

Next, configure bitsandbytes to enable the quantization of Phi-2 parameters. This involves specifying parameters such as 4-bit quantization type, compute data type and more.

# Quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
)

# Load base moodel
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
trust_remote_code=True,
device_map={"": 0}
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Then set up the LoRA configuration using peft parameters. LoRA configuration involves specifying parameters like rank, alpha, bias, task type, and target modules. These parameters determine how the model adapts during fine-tuning.

# LoRA configuration
peft_config = LoraConfig(
r= 64,
lora_alpha= 16,
lora_dropout=0.05, #0.1
bias="none",
task_type="CAUSAL_LM",
#target_modules= ["Wqkv", "out_proj"] #["Wqkv", "fc1", "fc2" ] # ["Wqkv", "out_proj", "fc1", "fc2" ]
)

Training parameters include setting up the directory for results, number of training epochs, batch size, optimization strategy, learning rate, and more. These parameters influence how Phi-2 adapts to the specific task during fine-tuning.

# Set training arguments
training_arguments = TrainingArguments(
output_dir = "./results",
num_train_epochs = 1,
fp16 = False,
bf16 = False,
per_device_train_batch_size = 4,
per_device_eval_batch_size = 4,
gradient_accumulation_steps = 1,
gradient_checkpointing = True,
max_grad_norm = 0.3,
learning_rate = 2e-4,
weight_decay = 0.001,
optim = "paged_adamw_32bit",
lr_scheduler_type = "cosine",
max_steps = -1,
warmup_ratio = 0.03,
group_by_length = True,
save_steps = 0,
logging_steps = 25,
)

After that, initialize a supervised fine-tuning trainer and train the model on the labeled dataset. After training, save the newly fine-tuned model.

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length= None,
tokenizer=tokenizer,
args=training_arguments,
)

# Train model
trainer.train()

Then, save the newly fine-tuned model.

# Save trained model
trainer.model.save_pretrained(new_model)

You can check the fine-tuned model’s performance using TensorBoard to visualize training results.

%load_ext tensorboard
%tensorboard --logdir results/runs
Image by Author

Then, test the model using following the text generation pipeline.


# Run text generation pipeline with our next model
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

prompt = "What are the treatments for Gastrointestinal Carcinoid Tumors?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200,)
result = pipe(f"### Instruction: {Input}")
print(result[0]['generated_text'])
Output:
There are several treatments available for Gastrointestinal Carcinoid Tumors, including:

1. Surgery: The primary treatment for Gastrointestinal Carcinoid Tumors is surgery to remove the tumor.

2. Chemotherapy: Chemotherapy is a treatment that uses drugs to kill cancer cells. It is often used in combination with surgery to treat Gastrointestinal Carcinoid Tumors.

3. Radiation therapy: Radiation therapy is a treatment that uses high-energy radiation to kill cancer cells. It is often used in combination with surgery and chemotherapy to treat Gastrointestinal Carcinoid Tumors.

4. Targeted therapy: Targeted therapy is a treatment that targets specific molecules or pathways that are involved in the growth and spread of cancer cells.

The model with only 2.7 billion parameters is performing pretty well as you can see. You can try and test more by asking difficult questions. or even make some changes in parameters to see how it performs.

After training and testing, clear the memory and if you want you can push everything to the Hugging Face Hub to save the model using the following code.

# Clear the memory
del model, pipe, trainer
torch.cuda.empty_cache()


# Reload model and merge it with LoRA parameters
model = AutoModelForCausalLM.from_pretrained(
base_model,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map={"": 0},
)
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


# Pushing model to huggingface
!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Uploading the model to Hugging Face is optional.

In this article, we covered the fundamental concepts of LLMs, the pretraining and fine-tuning process and then, we step by step fine-tuned the Phi-2 large language model.

By experimenting with different base models and training parameters, you can further enhance your understanding of fine-tuning and optimize models for specific tasks.

--

--

Prasad Mahamulkar

Data Scientist | AI/ML (Ops) Engineer | MSc Data Science