Fine-tuning Llama 2 for news category prediction: A step-by-step comprehensive guide to fine-tuning any LLM (Part 2)

Kshitiz Sahay
8 min readAug 7, 2023

--

A step-by-step comprehensive guide to fine-tuning any LLM.

In this blog, I will guide you through the process of fine-tuning Meta’s Llama 2 7B model for news article categorization across 18 different categories. I will utilize a news classification instruction dataset that I previously created using GPT 3.5. If you’re interested in how I generated that dataset and the motivation behind this mini-project, you can refer to my earlier blog or notebook where I discuss the details.

The purpose of this notebook is to provide a comprehensive, step-by-step tutorial for fine-tuning any LLM (Large Language Model). Unlike many tutorials available, I’ll explain each step in a detailed manner, covering all classes, functions, and parameters used.

This guide will be divided into two parts:

Part 1: Setting up and Preparing for Fine-Tuning

  1. Installing and loading the required modules
  2. Steps to get approval for Meta’s Llama 2 family of models
  3. Setting up Hugging Face CLI and user authentication
  4. Loading a pre-trained model and its associated tokenizer
  5. Loading the training dataset
  6. Preprocessing the training dataset for model fine-tuning

Part 2: Fine-Tuning and Open-Sourcing [This blog]

  1. Configuring PEFT (Parameter Efficient Fine-Tuning) method QLoRA for efficient fine-tuning
  2. Fine-tuning the pre-trained model
  3. Saving the fine-tuned model and its associated tokenizer
  4. Pushing the fine-tuned model to the Hugging Face Hub for public usage

Note that running this on a CPU is practically impossible. If running on Google Colab, go to Runtime > Change runtime type. Change Hardware accelarator to GPU. Change GPU type to T4. Change Runtime shape to High-RAM.

Let’s get started!

Creating PEFT Configuration

Fine-tuning pretrained LLMs on downstream datasets results in huge performance gains when compared to using the pretrained LLMs out-of-the-box. However, as models get larger and larger, full fine-tuning becomes infeasible to train on consumer hardware. In addition, storing and deploying fine-tuned models independently for each downstream task becomes very expensive, because fine-tuned models are the same size as the original pretrained model. Parameter-Efficient Fine-tuning (PEFT) approaches are meant to address both problems!

PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. It also helps in portability, wherein users can tune models using PEFT methods to get tiny checkpoints worth a few MB compared to the large checkpoints of full fine-tuning.

In short, PEFT approaches enable you to get performance comparable to full fine-tuning while only having a small number of trainable parameters.

Hugging Face provides the PEFT library, which provides the latest Parameter-Efficient Fine-tuning techniques seamlessly integrated with Hugging Face Transformers and Hugging Face Accelerate.

There are several PEFT methods. In the next cell, we will use QLoRA, one of the latest methods that reduces the memory usage of LLM finetuning without performance tradeoffs, using the LoraConfig class from the peft library.

QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then frozen, and a relatively small number of trainable parameters are added to the model in the form of Low-Rank Adapters. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. The LoRA layers are the only parameters being updated during training.

def create_peft_config(r, lora_alpha, target_modules, lora_dropout, bias, task_type):
"""
Creates Parameter-Efficient Fine-Tuning configuration for the model

:param r: LoRA attention dimension
:param lora_alpha: Alpha parameter for LoRA scaling
:param modules: Names of the modules to apply LoRA to
:param lora_dropout: Dropout Probability for LoRA layers
:param bias: Specifies if the bias parameters should be trained
"""
config = LoraConfig(
r = r,
lora_alpha = lora_alpha,
target_modules = target_modules,
lora_dropout = lora_dropout,
bias = bias,
task_type = task_type,
)

return config

Finding Modules for LoRA Application

In the next cell, we will define the find_all_linear_names function to find the module to apply LoRA to. This function will get the module names from model.named_modules() and store it in a set to keep distinct module names.

def find_all_linear_names(model):
"""
Find modules to apply LoRA to.

:param model: PEFT model
"""

cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])

if 'lm_head' in lora_module_names:
lora_module_names.remove('lm_head')
print(f"LoRA module names: {list(lora_module_names)}")
return list(lora_module_names)

Calculating Trainable Parameters

We can use the print_trainable_parameters function to find out the number and percentage of trainable model parameters. This function will calculate the number of total parameters in model.named_parameters() and then those that would get updated.

def print_trainable_parameters(model, use_4bit = False):
"""
Prints the number of trainable parameters in the model.

:param model: PEFT model
"""

trainable_params = 0
all_param = 0

for _, param in model.named_parameters():
num_params = param.numel()
if num_params == 0 and hasattr(param, "ds_numel"):
num_params = param.ds_numel
all_param += num_params
if param.requires_grad:
trainable_params += num_params

if use_4bit:
trainable_params /= 2

print(
f"All Parameters: {all_param:,d} || Trainable Parameters: {trainable_params:,d} || Trainable Parameters %: {100 * trainable_params / all_param}"
)

Fine-tuning the Pre-trained Model

We will create fine_tune, our final function, to wrap everything we have done so far and initiate the fine-tuning process. This function will perform the following model preprocessing operations to prepare it for training:

  1. Enable gradient checkpointing to reduce memory usage during fine-tuning.
  2. Use the prepare_model_for_kbit_training function from PEFT to prepare the model for fine-tuning.
  3. Call find_all_linear_names` to get the module names to apply LoRA to.
  4. Create LoRA configuration by calling the create_peft_config function.
  5. Wrap the base Hugging Face model for fine-tuning to PEFT using the get_peft_model function.
  6. Print the trainable parameters.

For training, we will instantiate a Trainer() object within the fine_tune function. This class requires the model, preprocessed dataset, and training arguments, listed below.

per_device_train_batch_size: The batch size per GPU/TPU/CPU for training.

gradient_accumulation_steps: Number of update steps to accumulate the gradients for, before performing a backward/update pass.

warmup_steps: Number of steps used for a linear warmup from 0 to learning_rate.

max_steps: If set to a positive number, the total number of training steps to perform.

learning_rate: The initial learning rate for Adam.

fp16: Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.

logging_steps: Number of update steps between two logs.

output_dir: The output directory where the model predictions and checkpoints will be written.

optim: The optimizer to use for training.

Next, we will use the train method on the trainer` object to start the training and log and save the model metrics on the training dataset. Finally, we will save the model checkpoint (model weights, configuration file, and tokenizer) in the output directory and delete the model to free up memory. You can load the model for inference later using its saved checkpoint.

def fine_tune(model,
tokenizer,
dataset,
lora_r,
lora_alpha,
lora_dropout,
bias,
task_type,
per_device_train_batch_size,
gradient_accumulation_steps,
warmup_steps,
max_steps,
learning_rate,
fp16,
logging_steps,
output_dir,
optim):
"""
Prepares and fine-tune the pre-trained model.

:param model: Pre-trained Hugging Face model
:param tokenizer: Model tokenizer
:param dataset: Preprocessed training dataset
"""

# Enable gradient checkpointing to reduce memory usage during fine-tuning
model.gradient_checkpointing_enable()

# Prepare the model for training
model = prepare_model_for_kbit_training(model)

# Get LoRA module names
target_modules = find_all_linear_names(model)

# Create PEFT configuration for these modules and wrap the model to PEFT
peft_config = create_peft_config(lora_r, lora_alpha, target_modules, lora_dropout, bias, task_type)
model = get_peft_model(model, peft_config)

# Print information about the percentage of trainable parameters
print_trainable_parameters(model)

# Training parameters
trainer = Trainer(
model = model,
train_dataset = dataset,
args = TrainingArguments(
per_device_train_batch_size = per_device_train_batch_size,
gradient_accumulation_steps = gradient_accumulation_steps,
warmup_steps = warmup_steps,
max_steps = max_steps,
learning_rate = learning_rate,
fp16 = fp16,
logging_steps = logging_steps,
output_dir = output_dir,
optim = optim,
),
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm = False)
)

model.config.use_cache = False

do_train = True

# Launch training and log metrics
print("Training...")

if do_train:
train_result = trainer.train()
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
print(metrics)

# Save model
print("Saving last checkpoint of the model...")
os.makedirs(output_dir, exist_ok = True)
trainer.model.save_pretrained(output_dir)

# Free memory for merging weights
del model
del trainer
torch.cuda.empty_cache()

Initializing QLoRA and TrainingArguments parameters below for training.

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 16

# Alpha parameter for LoRA scaling
lora_alpha = 64

# Dropout probability for LoRA layers
lora_dropout = 0.1

# Bias
bias = "none"

# Task type
task_type = "CAUSAL_LM"

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Batch size per GPU for training
per_device_train_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 4

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Optimizer to use
optim = "paged_adamw_32bit"

# Number of training steps (overrides num_train_epochs)
max_steps = 20

# Linear warmup steps from 0 to learning_rate
warmup_steps = 2

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True

# Log every X updates steps
logging_steps = 1

Calling the fine_tune function below to fine-tune or instruction-tune the pre-trained model on our preprocessed news classification instruction dataset.

fine_tune(model, tokenizer, preprocessed_dataset, lora_r, lora_alpha, lora_dropout, bias, task_type, per_device_train_batch_size, gradient_accumulation_steps, warmup_steps, max_steps, learning_rate, fp16, logging_steps, output_dir, optim)

With these steps, we have fine-tuned a popular open-source pre-trained model, Llama-2–7B, on an instruction dataset that we created for news classification!

We can see from the log that there are 3,540,389,888 parameters in the model, out of which 39,976,960 are trainable. That’s approximately 1% of the total parameters. The model trained for 20 steps and converged at a loss value of 1.4. It is possible that the converged weights are not the best weights. We can fix this by adding EarlyStoppingCallback to the trainer, which would regularly evaluate the model on a validation dataset and keep only the best weights.

Merging Weights & Pushing to Hugging Face

After saving the fine-tuned weights, we can create our fine-tuned model by merging the fine-tuned weights and saving it to a new directory with its tokenizer. By performing this step, we can have a memory-efficient, fine-tuned model and tokenizer for inference. We will also push the fine-tuned model and its associated tokenizer to Hugging Face Hub for public usage.

# Load fine-tuned weights
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map = "auto", torch_dtype = torch.bfloat16)
# Merge the LoRA layers with the base model
model = model.merge_and_unload()

# Save fine-tuned model at a new location
output_merged_dir = "results/news_classification_llama2_7b/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok = True)
model.save_pretrained(output_merged_dir, safe_serialization = True)

# Save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

# Fine-tuned model name on Hugging Face Hub
new_model = "sahayk/news-classification-18-llama-2-7b"

# Push fine-tuned model and tokenizer to Hugging Face Hub
model.push_to_hub(new_model, use_auth_token = True)
tokenizer.push_to_hub(new_model, use_auth_token = True)

Check out the fine-tuned model on Hugging Face: https://huggingface.co/sahayk/news-classification-18-llama-2-7b

Check out my Google Colab Notebook for full code.

Conclusion

With these steps, we have fine-tuned Llama 2 7B on a specific problem statement: classifying news articles across 18 categories. In fact, you should be able to fine-tune any LLM on any publicly available or custom dataset and open-source it on the Hugging Face Hub by following these steps.

Thank you for reading!

References: Hugging Face, Philipp Schmid, OVHcloud

--

--

Kshitiz Sahay

Kshitiz Sahay is a senior data scientist at Dun & Bradstreet Inc.