LLaMA 2: A Detailed Guide to Fine-Tuning the Large Language Model

Gobi Shangar

10 min readFeb 13, 2024

Large Language Models (LLMs):

Trained using massive datasets and models with a large number of parameters (e.g., GPT-3 with 175B parameters).
Commonly known as foundational models.

Generalization of LLMs:

Models like ChatGPT showcases high generalization in solving common problems.
In-context learning is achieved through a few examples (few-shot). (In-context learning allows a model to understand and interpret input data based on the context provided by a specific set of examples or context-providing information.)

Fine-tuning Challenges:

While few-shot learning is possible, fine-tuning becomes essential for better results in specific domains.
Directly fine-tuning all parameters of large models incurs significant costs.

Parameter-Efficient Fine-Tuning (PEFT):

Researchers focus on efficient fine-tuning strategies to overcome costs.
PEFT aims for optimal fine-tuning with fewer parameters.

Low-Rank Adaptation (LoRA):

Proposed by the Microsoft team.
Involves freezing pre-trained model weights (e.g., Llama) and fine-tuning with a small model.
Low-rank adaptation (LoRA) is a novel technique for fine-tuning large language models (LLMs) that significantly reduces the number of trainable parameters while maintaining or even improving their performance.
This is achieved by injecting smaller, trainable matrices into each layer of the LLM’s architecture, instead of directly modifying the original weights.

Achieving Excellent Results with LoRA:

Similar to the Adapter concept.
Small LoRA network inserted into specific layers.
Adapts the model to different tasks efficiently.

Benefits of LoRA:

Reduced memory footprint: By significantly reducing the number of trainable parameters, LoRA makes it possible to fine-tune large LLMs on devices with limited memory, such as mobile phones or edge devices.
Faster training and adaptation: With fewer parameters to update, LoRA training is much faster compared to traditional fine-tuning methods. This can be especially beneficial for tasks that require frequent adaptation to new data.
Improved performance: In some cases, LoRA can even lead to improved performance compared to traditional fine-tuning, especially for tasks that require few-shot learning or adaptation to small datasets.

How Low-Rank Adaptation (LoRA) Works:

Freezing Original Weights:

The pre-trained Large Language Model’s (LLM) weights are frozen.
These weights remain constant and are not updated during subsequent training steps.

Injecting Rank-Decomposition Matrices:

Small, trainable matrices are inserted into each layer of the LLM.
These matrices have a significantly lower rank (number of columns) compared to the original weight matrices.
The lower rank reduces the number of trainable parameters in the model.

Training Rank-Decomposition Matrices:

During training, only the parameters within the injected matrices are updated.
These matrices learn to adapt the pre-trained weights to the specifics of the new task or data.
The adaptation occurs with significantly fewer parameters than directly fine-tuning all weights.

By combining frozen original weights with trainable low-rank matrices, low-rank adaptation efficiently fine-tunes Large Language Models, making them adaptable to different tasks with reduced computational costs.

01. Configuring Models

The necessary libraries will be installed first.

pip install pyarrow 
pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

We will then load the required modules from these libraries after that.

import os
import torch
from datasets import Dataset, load_dataset
import pandas as pd
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

torch: a core library for tensor computations and deep learning.
peft: Enables efficient fine-tuning of large language models using low-rank adaptation techniques. It enables model compression and faster fine-tuning, especially on resource-constrained devices, by reducing the number of trainable parameters.
bitsandbytes: Provides quantization and binarization techniques for neural networks, aiding in model compression. It helps with model compression, making model deployment on edge devices with limited memory and computational power more feasible.
transformers: simplifies working with large language models, offering pre-trained models and training pipelines.
trl: Focuses on efficient model training and optimization, particularly for large-scale language models.
accelerate: accelerates training and inference on different hardware platforms.
datasets: simplifies dataset loading and preparation for machine learning tasks.
pipeline: Streamlines the use of pre-trained models for common NLP tasks without custom training.
pyarrow: likely used for efficient data loading and processing.
LoraConfig: Holds configuration parameters for LoRA-based fine-tuning.
SFTTrainer: handles model training, optimization, and evaluation.

02. Data Prepping

Large Language Models like Llama 2 benefit from various dataset types: Instruction, Raw Completion, and Preference. The instruction dataset, especially for Supervised Fine Tuning, is commonly used. In this tutorial, we’ll utilize the “garage-bAInd/Open-Platypus” instruction dataset from Hugging Face for fine-tuning.

os.environ["HF_TOKEN"] = "hf-"
dataset = load_dataset("garage-bAInd/Open-Platypus")
dataset["train"].to_pandas()

Convert a dataset containing ‘instruction’ and ‘output’ columns into a new dataset suitable for fine-tuning Llama.

def convert_dataset(data):
    instruction = data["instruction"]
    output = data["output"]
    prompt = f"<s>[INST] {instruction} [/INST] {output} </s>"
    return {'text': prompt}

converted_data = [convert_dataset(row) for row in dataset["train"]]
train_dataset = Dataset.from_pandas(pd.DataFrame(converted_data))
print(train_dataset[:5])

03. Setting up the Model and Tokenizer

In the process of enhancing the Llama 2 model to its improved version, llama-2–7b-finetune-enhanced (the name chosen arbitrarily), we undertake several crucial steps to ensure compatibility and optimization. Firstly, we configure the tokenizer to accommodate half-precision floating-point numbers (fp16), aiming to reduce memory consumption and expedite model training. However, as not all operations support this precision format, especially tokenization, adjustments are made to ensure compatibility. The setup of the model entails the following key steps:

a. Loading the Pre-trained Llama 2 Model:

We initiate the loading process for the pre-trained Llama 2 model, incorporating customized quantization configurations.
Caching functionalities are deactivated, and a pretraining temperature parameter is specified.

b. Quantization for Model Size Reduction:

To enhance inference speed and reduce the model size, a 4-bit quantization approach is employed, utilizing the `BitsAndBytesConfig`.
This quantization method entails representing model weights in a manner that conserves memory resources.

c. Quantization Type Configuration:

The quantization configuration opts for the ‘nf4’ type, further refining the quantization process.

By adhering to these meticulous steps, we effectively optimize the model, striking a balance between efficient memory utilization, expedited inference speed, and sustained high performance.

# Model and tokenizer names
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
refined_model = "llama-2-7b-finetune-enhanced"
# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  # Fix weird overflow issue with fp16 training

llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

Loads a pre-trained tokenizer for the Llama 2 model, crucial for text preprocessing.

llama_tokenizer.pad_token = llama_tokenizer.eos_token

Sets the padding token to the end-of-sentence token, ensuring uniform sequence length for batch processing.

llama_tokenizer.padding_side = “right”

Specifies padding token addition to the right side, essential for compatibility with half-precision floating-point operations.

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

quant_config = BitsAndBytesConfig()

Instantiates BitsAndBytesConfig for quantization, reducing model size and computational cost.

load_in_4bit=True

Directs loading weights in 4-bit quantized format, slashing memory usage and enhancing speed. It rigidly curtails memory demands, potentially expediting inference.

bnb_4bit_quant_type=”nf4"

Specifies the “nf4” quantization type, blending noise feedback and floating-point scaling for accuracy. Ensures precise yet stable quantization, vital for model performance.

bnb_4bit_compute_dtype=torch.float16

Sets the computation data type to torch.float16, leveraging half-precision for faster computation. Reduces memory footprint and speeds up computations, though it might slightly affect accuracy.

bnb_4bit_use_double_quant=False

Deactivates “double quantization,” streamlining complexity and avoiding compatibility concerns. Simplifies configuration while maintaining quantization benefits.

Quantization enhances model deployability on resource-limited devices, balancing size, performance, and accuracy. Experimentation with settings is crucial for optimal performance.

# Load Llama2
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0}
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

base_model_name

Specifies the pre-trained Llama 2 model’s name, defining its architecture and weights.

quantization_config=quant_config

Sets quantization configuration for model compression, potentially enhancing performance by reducing memory usage and speeding up computations.

device_map={“”: 0}

Defines device placement for model parameters, optimizing training and inference on GPUs by utilizing available hardware efficiently.

Additional Configurations:

base_model.config.use_cache = False

Deactivates activation caching during inference, simplifying the process and potentially reducing latency.
Caching can improve memory efficiency during training, but might not be beneficial for inference where memory usage is less critical. Disabling it simplifies inference and potentially reduces its latency.

base_model.config.pretraining_tp = 1

Sets temperature parameters for text generation, balancing randomness and determinism in generated outputs for general-purpose use.
A higher temperature increases randomness and diversity in generated text, while a lower temperature leads to more deterministic and conservative outputs. Setting it to 1 here creates a balance for general-purpose use.

04. Configuring LoRA

peft_parameters = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)

This code defines a LoraConfig object using the peft library for fine-tuning the loaded Llama 2 model with Low-Rank Adaptation (LoRA). Here’s a breakdown of each parameter:

lora_alpha=16

Purpose: Controls the learning rate of the low-rank matrices in LoRA.
Importance: A higher alpha increases the update speed of the low-rank matrices, potentially leading to faster adaptation but also higher risk of overfitting. A lower alpha allows for more cautious learning and adaptation.

lora_dropout=0.1

Purpose: Applies dropout regularization to the low-rank matrices, preventing them from overfitting to the specific training data.
Importance: Dropout randomly drops out a certain proportion of connections during training, encouraging the model to learn more generalizable features and reducing overfitting.

r=8

Purpose: Defines the rank of the low-rank matrices used in LoRA.
Importance: A higher rank allows for capturing more complex relationships in the original weights, but also increases the number of trainable parameters. A lower rank reduces the number of parameters but might limit the model’s ability to adapt effectively. The optimal rank depends on the specific task and dataset.

bias=”none”

Purpose: Determines how bias terms are handled in LoRA.
“none” indicates that no additional bias terms are added to the low-rank matrices. Other options include “learned” to learn individual biases for each element or “shared” to use a single bias term for all elements.

task_type=”CAUSAL_LM”

Purpose: Specifies the type of task the model is being fine-tuned for.
“CAUSAL_LM” indicates a causal language modeling task, where the model predicts the next token in a sequence based on the previous ones. This helps the library adapt the LoRA configuration to the specific task type.

05. Training Configuration

train_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

This code defines a TrainingArguments object using the transformers library to configure various aspects of the fine-tuning process for the Llama 2 model. Here’s a breakdown of each parameter:

output_dir=”./results_modified”

Purpose: Specifies the directory to save training results, such as checkpoints and logs.

num_train_epochs=50

Purpose: Sets the number of training epochs, determining how many times the entire dataset is passed through the model.

per_device_train_batch_size=4

Purpose: Defines the batch size for each GPU during training, impacting memory usage and training speed.

gradient_accumulation_steps=1

Purpose: Accumulates gradients across multiple batches before updating model weights, potentially enhancing memory efficiency and training stability.

optim=”paged_adamw_32bit”

Purpose: Selects the optimizer for updating model weights during training, using a memory-efficient variant of the AdamW optimizer with 32-bit precision.

save_steps=25

Purpose: Specifies how often to save model checkpoints during training, balancing performance and data safety.

logging_steps=25

Purpose: Determines logging frequency for training metrics like loss and accuracy, influencing training visibility and overhead.

learning_rate=2e-4

Purpose: Sets the learning rate, controlling the magnitude of weight updates based on training data.

weight_decay=0.001

Purpose: Applies L2 regularization to prevent overfitting.

fp16=False and bf16=False

Purpose: Disables mixed-precision training using 16-bit floating-point formats, opting for 32-bit precision.

max_grad_norm=0.3

Purpose: Clips the gradient norm to prevent excessively large updates to model weights.

max_steps=-1

Purpose: Sets the maximum number of training steps, here indicating training until dataset exhaustion.

warmup_ratio=0.03

Purpose: Gradually increase the learning rate during initial training steps.

group_by_length=True

Purpose: Groups training examples by length during batching to improve efficiency.

lr_scheduler_type=”constant”

Purpose: Defines the learning rate scheduler, maintaining a constant rate throughout training.

report_to=”tensorboard”

Purpose: Specifies where to report training metrics, in this case, TensorBoard for visualization and analysis.

fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

This code snippet defines a SFTTrainer object from the trl library to handle the fine-tuning process of your Llama 2 model. Here’s a breakdown of its components:

Parameters:

model: Pre-trained Llama 2 model with quantization and configuration.
train_dataset: Training dataset containing text data for fine-tuning.
peft_config: LoRA configuration controlling model adaptation.
dataset_text_field: Field name in training dataset containing text data.
tokenizer: Pre-trained tokenizer used to prepare text data.
args: Training configuration parameters.

What does the SFTTrainer do?

The SFTTrainer efficiently trains large language models by:

Preprocessing data: Converts raw text data into numerical representations.
Batching: Group data examples for efficient processing.
Loss calculation: Computes loss between model predictions and targets.
Gradient calculation: Determines gradients of the loss function.
Parameter updates: Updates model parameters using gradients and optimizer.
LoRA adaptation: Adjusts low-rank matrices based on gradients.
Logging and reporting: Tracks and reports training metrics.

fine_tuning.train()
fine_tuning.model.save_pretrained(refined_model)

Running fine_tuning.train() initiates the fine-tuning process iteratively over the dataset. Effectiveness depends on data quality, configuration, and hardware.

results

print(fine_tuning)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

%load_ext tensorboard
%tensorboard --logdir results/runs

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Run text generation pipeline with our next model
prompt = "What is the definition of \"unusual event\""
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

06. Merge the Base Model with the Trained Adapter

# Reload model in FP16 and merge it with LoRA weights
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
#Reload the Base Model and load the QLoRA adapters
model = PeftModel.from_pretrained(model, refined_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"