Parameter Efficient Fine Tuning using LORA

Published in

Techsalo Infotech

10 min readJul 5, 2024

What are Large Language Models?

Large Language Models (LLMs) are a type of artificial intelligence model designed to understand and generate human-like text based on vast amounts of data. They are trained on diverse datasets that include books, articles, websites, and other text sources, enabling them to perform various language-related tasks such as translation, summarization, question answering, and text generation. LLMs use advanced machine learning techniques, particularly deep learning and neural networks, to predict the likelihood of a word or sequence of words in context, allowing them to generate coherent and contextually relevant responses. Examples of LLMs include OpenAI’s GPT-4, Google’s BERT, and Facebook’s RoBERTa.

Why do we need Fine-tuning?

1. Domain Specialization: LLMs are pre-trained on a broad range of text from the internet, but specific applications often require domain-specific knowledge. Fine-tuning on targeted datasets helps the model become more proficient in specialized areas, such as medical terminology, legal language, or technical jargon.

2. Improved Accuracy: By fine-tuning on task-specific data, the model’s performance improves on particular tasks like sentiment analysis, machine translation, or question answering. This targeted training helps reduce errors and increase the model’s accuracy.

3. Customization: Fine-tuning allows the adaptation of LLMs to specific organizational needs or preferences. For example, a company might fine-tune a model to align with its brand voice, industry-specific language, or proprietary data.

4. Bias Reduction: Pre-trained LLMs can inadvertently learn biases present in the training data. Fine-tuning on carefully curated datasets can help mitigate these biases and ensure the model generates more fair and balanced outputs.

5. Resource Efficiency: Fine-tuning is often more resource-efficient than training a model from scratch. Leveraging pre-trained LLMs as a foundation and fine-tuning them on smaller, task-specific datasets saves computational resources and time.

6. Regulatory Compliance: Certain industries have specific regulatory requirements regarding data handling and processing. Fine-tuning LLMs on compliant datasets ensures that the models adhere to industry regulations and standards.

What are the different types of Fine-Tuning?

1. Full Fine-Tuning: All the model’s parameters are updated during training on a new, task-specific dataset.

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset('glue', 'mrpc')
# Load pre-trained model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize dataset
def tokenize(batch):
  return tokenizer(batch['sentence1'], batch['sentence2'], padding=True, truncation=True)
train_dataset = dataset['train'].map(tokenize, batched=True)
test_dataset = dataset['validation'].map(tokenize, batched=True)
# Training arguments
training_args = TrainingArguments(
  output_dir='./results',
  num_train_epochs=3,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  warmup_steps=500,
  weight_decay=0.01,
  logging_dir='./logs',
  logging_steps=10,
)
# Trainer
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
)
# Fine-tune model
trainer.train()

Pros: Can lead to high performance on the target task.

Cons: Requires significant computational resources and time. May overfit if the dataset is small.

2. Feature-based Transfer Learning: The pre-trained LLM is used to extract features, which are then fed into a simpler model (e.g., a linear classifier) that is trained on the task-specific data.

from transformers import BertModel, BertTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Load pre-trained model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize and extract features
def extract_features(texts):
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1).detach().numpy()
# Train logistic regression model on extracted features
train_features = extract_features(train_texts)
test_features = extract_features(test_texts)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
clf.fit(train_features, train_labels)
# Evaluate model
accuracy = clf.score(test_features, test_labels)
print(f'Accuracy: {accuracy}')

Pros: Computationally efficient and easy to implement.

Cons: May not capture complex task-specific nuances as well as other methods.

3. Fine-Tuning with Freezing: Some layers of the pre-trained model are frozen (i.e., their parameters are not updated) while others are fine-tuned on the new data.

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
# Load pre-trained model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Freeze all layers except the classifier
for param in model.bert.parameters():
param.requires_grad = False
# Tokenize dataset
train_dataset = dataset['train'].map(tokenize, batched=True)
test_dataset = dataset['validation'].map(tokenize, batched=True)
# Training arguments and trainer
training_args = TrainingArguments(
  output_dir='./results',
  num_train_epochs=3,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  warmup_steps=500,
  weight_decay=0.01,
  logging_dir='./logs',
  logging_steps=10,
)
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
)
# Fine-tune model
trainer.train()

Pros: Reduces the risk of overfitting and saves computational resources.

Cons: Requires careful selection of which layers to freeze.

4. Domain Adaptation: The model is fine-tuned on a large corpus from a specific domain (e.g., legal texts) before being fine-tuned on the task-specific data.

# Assume domain-specific pre-training has been done and saved as 'bert-domain-adapted'
model = BertForSequenceClassification.from_pretrained('bert-domain-adapted')
tokenizer = BertTokenizer.from_pretrained('bert-domain-adapted')
# Tokenize dataset
train_dataset = dataset['train'].map(tokenize, batched=True)
test_dataset = dataset['validation'].map(tokenize, batched=True)
# Training arguments and trainer
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
# Fine-tune model
trainer.train()

Pros: Improves performance on domain-specific tasks.

Cons: Requires access to a large domain-specific dataset.

5. Parameter-Efficient Fine-Tuning (PEFT): Fine-tunes only a small subset of the model’s parameters or adds a small number of task-specific parameters, keeping most of the model’s weights fixed.

from transformers import BertTokenizer, BertForSequenceClassification, AdapterConfig
# Load pre-trained model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Add and activate adapter
adapter_config = AdapterConfig.load("pfeiffer")
model.add_adapter("mrpc", config=adapter_config)
model.train_adapter("mrpc")
# Tokenize dataset
train_dataset = dataset['train'].map(tokenize, batched=True)
test_dataset = dataset['validation'].map(tokenize, batched=True)
# Training arguments and trainer
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
# Fine-tune model
trainer.train()

Pros: Significantly reduces the computational resources and time needed for fine-tuning. Can be highly effective when data is limited.

Cons: May not achieve the same level of performance as full fine-tuning for some tasks.

Why PEFT tuning is preferred over others for Fine tuning?

Parameter-Efficient Fine-Tuning (PEFT) offers significant advantages over traditional fine-tuning methods, particularly in resource efficiency, scalability, and overall practicality. In conventional full fine-tuning, all the model’s parameters are updated, which can be computationally expensive and time-consuming, especially for large models. For instance, a model like GPT-3 with 175 billion parameters would require immense computational resources and storage to fine-tune entirely. In contrast, PEFT methods such as adapters, LoRA (Low-Rank Adaptation), and prompt tuning significantly reduce the number of parameters that need updating. Typically, PEFT involves fine-tuning only 1–10% of the model’s parameters, leading to faster training times and lower memory requirements.

For example, using adapters can lead to a 90% reduction in trainable parameters while maintaining comparable performance to full fine-tuning. This efficiency does not come at the cost of performance. In practice, PEFT methods often achieve accuracy levels close to or even matching full fine-tuning. In a comparative study, adapters applied to BERT for the GLUE benchmark suite resulted in a minimal drop in performance (less than 1% in most cases) while reducing the computational cost by an order of magnitude.

Moreover, PEFT is highly advantageous in scenarios with limited labeled data. Full fine-tuning on small datasets can lead to overfitting, whereas PEFT’s reduced parameter space mitigates this risk, providing more robust generalization. Additionally, PEFT allows for rapid adaptation of models to multiple tasks without extensive retraining, facilitating more flexible deployment in real-world applications. For instance, a PEFT-tuned model can be quickly adapted to new domains such as legal or medical text processing with minimal computational overhead.

In summary, PEFT methods offer a more efficient, scalable, and practical approach to fine-tuning large language models. By significantly reducing the number of trainable parameters and computational costs, while maintaining high performance, PEFT stands out as an optimal solution for deploying LLMs in diverse and resource-constrained environments.

Parameter-Efficient Fine-Tuning (PEFT) Methods

Adapters
Low-Rank Adaptation (LoRA)
Prompt Tuning

Adapters are small bottleneck layers inserted within the pre-trained model’s layers. During fine-tuning, only the parameters of these adapters are updated, while the rest of the model remains fixed. This approach significantly reduces the number of trainable parameters.

LoRA involves adding trainable low-rank matrices to the existing layers, allowing for efficient adaptation by modifying only a small fraction of the model parameters.It decomposes the weight matrices into low-rank components that are trainable.This method significantly reduces the number of parameters to be trained, leading to faster training and lower resource consumption.It allows for efficient fine-tuning without compromising much on performance.

Qlora (Query Low-Rank Adaptation) is an extension of the Low-Rank Adaptation (LoRA) technique that focuses on adaptively learning low-rank projections for query representations in transformer-based models using quantization.

PEFT Tuning Using LORA-

Step 1. Import Libraries

# For any HF basic activities like loading models
# and tokenizers for running inference
# upgrade is a must for the newest Gemma model
!pip install - upgrade datasets
!pip install - upgrade transformers
# For doing efficient stuff - PEFT
!pip install - upgrade peft
!pip install - upgrade trl
!pip install bitsandbytes
!pip install accelerate
# for logging and visualizing training progress
!pip install tensorboard
# If creating a new dataset, useful for creating *.jsonl files
!pip install jsonlines

datasets and transformers: Essential for working with Hugging Face models and datasets.
peft and trl: For implementing parameter-efficient fine-tuning techniques.
bitsandbytes: Helps in reducing memory usage by using 8-bit precision.
accelerate: Optimizes training speed and handles distributed training.
tensorboard: Provides tools for visualizing training metrics.
jsonlines: Useful for handling datasets in JSON lines format

Step 2 . Load Model

load_model = 'distilbert-base-uncased'

Specifies the pre-trained model to be used (DistilBERT in this case), which is a lightweight version of BERT.

Step 3. Define Labels for mapping

Id_to_Label = {0: 'Neg', 1: 'Pos'}
Label_to_Id = {'Neg': 0, 'Pos': 1}

Maps numerical labels to human-readable labels and vice versa. This is useful for interpreting the model’s predictions.

Step 4. Load models from original checkpoints

config = AutoConfig.from_pretrained(load_model, num_labels=2)
model_loading = AutoModelForSequenceClassification.from_pretrained(load_model, config=config)
# If you need to use the label mappings later
model_loading.config.id2label = Id_to_Label
model_loading.config.label2id = Label_to_Id

Loads the pre-trained model and its configuration, setting up the model for sequence classification with two labels (negative and positive sentiment).

Step 5. Load Dataset

dataset = load_dataset("glue", "sst2")

Loads the SST-2 dataset from the GLUE benchmark, which is a standard dataset for sentiment analysis tasks.

Step 6. Preprocessing data- Tokenise and Mapping

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(load_model, add_prefix_space=True)
# create custom tokenize function
if tokenizer.pad_token is None:
   tokenizer.add_special_tokens({'pad_token': '[PAD]'})
   model_loading.resize_token_embeddings(len(tokenizer))
def tokenize_function(examples):
   text = examples["sentence"]
   tokenizer.truncation_side = "left"
   tokenized_inputs = tokenizer(
       text,
       return_tensors="np",
       truncation=True,
       max_length=512
   )
   return tokenized_inputs
# Call function and map it to all texts
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Tokenizes the dataset, ensuring that the text data is properly formatted for the model. Adjusts the tokenizer to add a pad token if it’s missing.

Step 7. Import data collater from tranformers

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Uses a data collator to handle padding dynamically, ensuring that input sequences in a batch have the same length.

Step 8. Performace Evaluation Metrices

accuracy = evaluate.load("accuracy")
# Define an evaluation function
def compute_metrics(p):
   predictions, labels = p
   predictions = np.argmax(predictions, axis=1)
   return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

Loads the accuracy metric for evaluating model performance and defines a function to compute accuracy during evaluation.

Step 9. Sample data for testing

import torch

# Move model to the GPU
model_loading.to('cuda')
# Sample texts for testing
test_texts = [
    "The movie was not that fantastic!",
    "I had a terriblly rocking day.",
    "The new phone has an amazing camera.",
    "I'm not feeling well today.",
    "The food is great yet costly",
    ", though many of the actors throw off a spark or two when they first appear , they ca n't generate enough heat in this cold vacuum of a comedy to start a reaction .",
    "khouri manages , with terrific flair , to keep the extremes of screwball farce and blood-curdling family intensity on one continuum .",
    "alternating between facetious comic parody and pulp melodrama , this smart-aleck movie ... tosses around some intriguing questions about the difference between human and android life",
    "it is freezing outside like winterfell"
]

# Predict sentiment for each text
print("Trained model predictions:")
for text in test_texts:
    # Tokenize and move inputs to the GPU
    inputs = tokenizer.encode(text, return_tensors="pt").to('cuda')
    # Make sure model is in evaluation mode
    model_loading.eval()
    with torch.no_grad():
        logits = model_loading(inputs).logits
    predictions = torch.argmax(logits)
    print(f"{text} - {Id_to_Label[predictions.item()]}")

Prepares the model for inference by moving it to the GPU and predicting sentiment for sample texts. The model is set to evaluation mode to avoid updating weights during inference.

Step 10: Before Parameter Efficient Fine Tuning predictions

# Untrained Model predictions before applying PEFT technique
print("Untrained model predictions:")
for text in text_list:
inputs = tokenizer.encode(text, return_tensors="pt").to('cuda')
logits = model_loading(inputs).logits
predictions = torch.argmax(logits)
print(text + " - " + Id_to_Label[predictions.tolist()])

Evaluates the untrained model’s predictions to establish a baseline performance before applying PEFT.

Step 11: Create Parameter Efficient Fine Tuning Config File

peft_config = LoraConfig(task_type="SEQ_CLS",
                  r=4,
                  lora_alpha=32,
                  lora_dropout=0.01,
                  target_modules=["q_lin", "k_lin","v_lin"])
                  # target_modules=["q", "v"])
                  #target_modules = ['query_key_value'])

Configures the PEFT with LoRA settings, specifying the target modules for low-rank adaptations, and setting parameters such as rank (r), scaling factor (lora_alpha), and dropout rate (lora_dropout).

Step 12: Load the Parameter Efficient Fine Tuning Model and define hyperparameters

model_loading = get_peft_model(model_loading, peft_config)
model_loading.print_trainable_parameters()
print("Loading PEFT model",model_loading)
# Training Parameters
lr = 1e-3
batch_size = 4
num_epochs = 1

Loads the model with the PEFT configuration and sets hyperparameters for training, including learning rate, batch size, and number of epochs.

Step 13: Create Training Arguments

training_args = TrainingArguments(
   output_dir= load_model + "-lora-text-classification",
   learning_rate=lr,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   num_train_epochs=num_epochs,
   weight_decay=0.01,
   evaluation_strategy="epoch",
   save_strategy="epoch",
   load_best_model_at_end=True,
)

Specifies training arguments for the fine-tuning process, including output directory, learning rate, batch size, number of epochs, and strategies for evaluation and saving the best model.

Step 14: Load trainer for training the Parameter Efficient Fine Tuning Model

trainer = Trainer(
   model=model_loading,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["validation"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)
#train function calling
trainer.train()

Sets up the Trainer with the model, training arguments, datasets, tokenizer, data collator, and evaluation metrics, and initiates the training process.

Step 15 . Inference

import torch

# Move model to the GPU
model_loading.to('cuda')

# Sample texts for testing
test_texts = [
    "The movie was not that fantastic!",
    "I had a terriblly rocking day.",
    "The new phone has an amazing camera.",
    "I'm not feeling well today.",
    "The food is great yet costly",
    ", though many of the actors throw off a spark or two when they first appear , they ca n't generate enough heat in this cold vacuum of a comedy to start a reaction .",
    "khouri manages , with terrific flair , to keep the extremes of screwball farce and blood-curdling family intensity on one continuum .",
    "alternating between facetious comic parody and pulp melodrama , this smart-aleck movie ... tosses around some intriguing questions about the difference between human and android life",
    "it is freezing outside like winterfell"
]

# Predict sentiment for each text
print("Trained model predictions:")
for text in test_texts:
    # Tokenize and move inputs to the GPU
    inputs = tokenizer.encode(text, return_tensors="pt").to('cuda')
    # Make sure model is in evaluation mode
    model_loading.eval()
    with torch.no_grad():
        logits = model_loading(inputs).logits
    predictions = torch.argmax(logits)
    print(f"{text} - {Id_to_Label[predictions.item()]}")

Conclusion : The current LLM has been fine tuned over custom dataset using parameter efficient fine tuning using low rank adaptation .

Note: We at Techsalo Infotech are a team of engineers solving complex Data engineering and Machine learning problems. Please reach out to us at sales@techsalo.com for any query on How to build these systems at scale and in the cloud.

Parameter Efficient Fine Tuning using LORA

What are the different types of Fine-Tuning?

Parameter-Efficient Fine-Tuning (PEFT) Methods

Written by Mansi