Kailash Thiyagarajan
5 min readJan 26, 2024

Fine-Tuning Large Language Models with LORA: Demystifying Efficient Adaptation

Introduction

Large Language Models (LLMs) like Open AI(GPT 3, GPT 3.5, PaLM, LLaMA ) are revolutionizing text generation, translation, and writing assistance. But unleashing their full potential for specific tasks often requires fine-tuning. This article explores how LORA (Low-Rank Adaptation) and its derivative, QLoRA (Quantized LORA), offer a faster, more efficient way to adapt LLMs to your needs.

What is Fine Tuning?

Think of training a model like learning a new skill. You start with basic knowledge (initializing parameters), practice (forward pass), and compare your results to the target (desired output). Then, you refine your technique (backward pass) and repeat. Eventually, you become proficient (trained model).

But what if you want to specialize? That’s where finetuning comes in. It’s like focusing on a specific aspect of the skill, taking your existing knowledge and adapting it to a new, narrower context. You basically keep practicing and refining, but with a different target in mind.

Customizing LLMs for Specific Tasks

Imagine fine-tuning an LLM as a process of customizing a versatile writing template for a particular genre or audience. Traditional fine-tuning methods often involve adjusting a large portion of the template’s elements, which can take time and effort. LoRA and QLoRA offer a more focused approach, akin to refining specific sections of the template for maximum impact.

There are various methods of fine tuning

Layer-wise Fine Tuning: This is the most commonly used method where either the early layers of the pre-trained model (closest to input) are frozen, and only the later layers are fine-tuned or the entire model is fine tuned.

Parameter Selective Fine-tuning: This method identifies and updates only a subset of parameters deemed relevant for the task.

Adapter-based Fine-tuning: This technique introduces lightweight “adapter” modules alongside the LLM, containing task-specific adjustments.

Welcome to LORA

Low-Rank Adaptation (LoRA) method is a fine-tuning method introduced by a team of Microsoft researchers in 2021. LORA has extended the idea which is quoted in this paper to one level further

INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING

We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach

Why LORA?

LORA is designed to fine-tune large-scale models efficiently by targeting a small subset of the model’s weights that have the most significant impact on the task at hand. This contrasts with traditional fine-tuning, where many more weights might be updated. LORA achieves this by:

  • Tracking changes to weights instead of updating them directly.
  • Decomposing large matrices of weight changes into smaller matrices that contain the “trainable parameters.”

This approach offers several advantages:

  • Significant reduction in trainable parameters, leading to faster and more efficient fine-tuning.
  • Preservation of the original pre-trained weights, allowing for multiple lightweight models for different tasks.
  • Compatibility with other parameter-efficient methods, enabling further optimization.
  • Comparable performance to fully fine-tuned models in many cases.
  • No additional inference latency, as adapter weights can be merged with the base model.

LoRA Shrinks and Speeds Up Fine-Tuning Through Matrix Decomposition

  • This is more for conceptual understanding.Imagine a 5x5 matrix as a storage unit with 25 spaces. LORA breaks it down into two smaller matrices through matrix decomposition with “r” as rank(the dimension): a 5x1 matrix (5 spaces) and a 1x5 matrix (5 spaces). This reduces the total storage requirement from 25 to just 10, making the model more compact.
  • Not only does this save space, but it also accelerates computations. Working with smaller matrices involves fewer calculations, leading to faster fine-tuning.

Strategic Focus on Attention Blocks for Maximum Efficiency

  • While LORA can potentially be applied to different parts of a neural network, it’s often strategically used on attention blocks within Transformer models.
  • Attention blocks play a key role in LLMs, focusing on the most relevant information during language processing. By selectively adapting these blocks, LORA achieves significant efficiency gains without compromising overall performance.

QLoRA: A Quantized Step for Even More Efficiency

QLoRa takes LORA a step further by quantizing the trainable parameters, representing them with fewer bits. This reduces model size even more, potentially enabling deployment on devices with limited memory and computational resources.

Sample Implementation

Enabling LoRA:

  • Tell LoRA which parts to train:
  • Use LoraConfig to specify which parts of the model to update using LoRA.
  • Target the “query” and “value” matrices in the attention blocks.
  • Wrap the model:
  • Enclose the base model with PeftModel to enable LoRA.

Training both LoRA and classifier:

  • By default, only LoRA parameters are trained:
  • This means the pre-trained parts and the newly added classifier won’t learn.
  • To train the classifier as well:
  • Use the modules_to_save setting to include it in the training.
  • This also saves those parameters when you save the model.

And then the train using the parameter efficient model

I hope you found this article useful, and if you did, consider giving claps. 👏 :)

Appendix:

Precision:

  • It refers to how accurately numbers are represented in a computer.
  • Neural networks typically use 32-bit floating-point numbers, which offer high precision but can be memory-intensive.
  • Example: Imagine storing the number 3.14159 as a 32-bit float.

Quantization:

  • It’s a technique to reduce the size of a neural network by representing numbers with fewer bits.
  • Common quantization methods include:
  • Half precision (16 bits)
  • Integer representation (e.g., 8 bits)
  • Example: Using 16-bit half-precision for 3.14159 might store it as 3.14.

Benefits of quantization:

  • Smaller models: Reduced memory usage and faster inference.
  • Efficient deployment: Can run on less powerful devices like smartphones.

Trade-offs:

  • Accuracy loss: Quantization can introduce small errors, potentially affecting model performance.

Example:

  • Model Size: A model with 1 million weights in 32-bit floats would take 4 MB of memory.
  • Using 16-bit half-precision would reduce it to 2 MB.
  • Further quantizing to 8-bit integers could reduce it to 1 MB.
  • Accuracy: The accuracy impact of quantization varies, but it’s often minimal for many tasks.

Key Point:

  • Precision and quantization are essential techniques for optimizing model size and speed while balancing potential accuracy trade-offs.
Kailash Thiyagarajan

Senior ML@Apple. Architecting Machine learning solutions at scale