Demystifying LoRA & Q-LoRA

6 min readSep 23, 2023

In today’s data science universe, large language models (LLMs) are gaining huge attraction and this attraction is bringing users to try it for various use cases, be it writing a prompt and seeking the answers to questions like “Why did Kattappa kill Bahubali” or “Why not to buy iPhone?”; or exploring it for fine-tuning problems like chat assistance, classification of tumors to a lot more.

Fine-Tune: It is an approach where the model is initialized with the pre-trained weights and biases; and all/some of the parameters undergo gradient updates.

A real problem arises when we need to fine-tune the model for a use case, but we do not have as many resources (and computation power) as it takes to update all the parameters. To roughly give you an idea of how many parameters are we talking about here, below is the specification for some of the popular LLMs:

So now the question arises, does that mean fine-tuning will cost a huge amount of money and is not for everyone to explore?

My answer would be no, and in this article, I will talk about an approach/technique using which we can fine-tune such big models with very few resources (even free GPU in Google Colab will be able to fine-tune a model with data of a certain size). So let's demystify the heading 😁

LoRA

LoRA stands for Low-Rank Adaptation. It is a way by which pre-trained weights and biases of the model are frozen and trainable decomposed matrices are added into each layer of the Transformer architecture.

The dimension of these added decomposed matrices is decided in such a way that the number of trainable parameters is reduced by a greater extent (In the paper, authors have mentioned that they were able to reduce the trainable parameters in GPT-3 by 10,000 times without affecting the performance); so we can just imagine how game-changing this technique will be.

The inspiration for this technique came from this paper, where the authors have mentioned that the pre-trained models have a very low intrinsic dimension and can learn efficiently in smaller subspaces despite having random projections; in simple words, I can say that there exists a low-dimension parameterization that is as effective for fine-tuning as the full parameter space.
Hence on top of that, LoRA hypothesizes that the weights also have a low intrinsic rank/dimension during adaptation.

So far we have talked a lot about hypotheses, now let us talk maths.

Let us pick up a layer having pre-trained weight (W), input to this layer is represented as x. So, our equation becomes h = Wx.

As we mentioned above we are going to freeze the pre-trained weights, i.e. we don’t want to update the pre-trained weights at the time of backpropagation, hence we introduce another term called ∆W. Now all the updates will be happening over ∆W. So our equation now becomes:

h = (W+∆W)x
  = Wx + ∆Wx
where W ∈ ℝᵈˣᵏ, ∆W ∈ ℝᵈˣᵏ

So now here comes 2 paths with ∆W:
1. Update the whole (d, k) matrix (which ), or
2. Another can be to decompose the ∆W into 2 smaller matrices (LoRA technique)

The first path is nothing new as compared to the basic version and will need a huge amount of resources and computation power. So let’s talk about the latter one, the matrix ∆W is decomposed into 2 matrices such that

∆W = BA
where 𝖡 ∈ ℝᵈˣʳ, A ∈ ℝᵈˣʳ and 𝗋 << min(d, k)

Since both A and B matrix is way less in size than ∆W, hence updation would require less memory and computation power at the time of backpropagation. As we know the weight has an intrinsically low rank, so we do not need much information to represent them and hence we use rank r as a hyperparameter to indicate what rank we want these decomposed matrics to have.

Things to note:

At the time of training, we use random Gaussian initialization for the matrix A and zero for the matrix B. Hence ∆W = BA will be zero at the start of training. Also BAx is scaled by α/r, where α is a constant in r i.e. α is set to be the first value of r and is not tuned. The scaling is used to reduce the need for retuning hyperparameters when r are changed.
At inference time, we can just merge the updated weights with the base pre-trained weights, which means that there will be potentially zero latency.
If there are multiple use cases fine-tuned over the same base model, instead of loading the same base model again and again, we can just remove the updated weights of one use case and add the weights of another.

Now you might be thinking that theoretically, it is nice, but how to code it? So I have good news for you, we already have a package called peft which internally takes care of all the maths, the only thing that we need to do is to pass on the parameters. For the demo exercise I will be using Python as the coding language, so let's start exploring it step-by-step:

Install packages

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

2. Load a model:

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

3. Creating LoRA config and getting the PEFT model.

config = LoraConfig(
    r=8, # rank of the pre-trained matrix
    lora_alpha=32, # alpha
    target_modules=["self_attn.q_proj", "self_attn.k_proj",
                    "self_attn.v_proj", "self_attn.o_proj"], #specific to Llama models.
    lora_dropout=0.05, # Drop-out proportion (Similar to what we used in NN)
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Transitioning original model to have LoRA layers.
model = get_peft_model(model, config)

# Prints the total trainable parameters
model.print_trainable_parameters()
# ###
# OUTPUT: trainable params: 8388608 || all params: 3508801536 || trainable%: 0.23907331075678143
# ###

Voila!!, now we have a model with fewer trainable parameters as compared to the original parameters. Hence now we can easily fine-tune the model using fewer resources and computation power.

Though we have successfully reduced the trainable parameters, is there a further way by which the memory consumption of the model be reduced? to answer this question, let's demystify the second part of the heading which is Q-LoRA.

Q-LoRA

Q-LoRA stands for Quantized Low-Rank Adaptation.
Quantization is a process of discretizing an input from a representation that holds more information to a representation with less information. In simple words, it means taking a data type with more bits and converting it to fewer bits.

QLoRA has introduced 3 terminologies to reduce memory without impacting the performance:

4-bit NormalFloat: It is an optimal quantization data type for normally distributed data which gives better results than 4-bit Integers and 4-bit Floats.
Double Quantization: It quantizes the quantization constant. Though this step seems trivial as an individual step, in unison, it saved 3GB overall for the model with 65B parameters.
Paged Optimizers: It use NVIDIA unified memory to avoid gradient checkpointing memory spikes. In simple words, it performs page-to-page automatic transfers between CPU and GPU for error-free GPU processing in scenarios where the GPU runs out of memory.

Now let's see how it is done in Python:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# This config is telling to quantize a 32-bit model to 4-bit model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf", 
    quantization_config=bnb_config, 
    device_map={"":0}
)

The rest will be the same as the code shown in the LoRA section. Once we use this approach, we are reducing the memory consumption as well along with the resources and computation power.

Conclusion:

So in this blog, we learned why we need approaches like LoRA and Q-LoRA. Further, we delve into the hypothesis and mathematical intuitions behind them; and finally showcase how it is done in Python.

Demystifying LoRA & Q-LoRA

LoRA

Q-LoRA

Conclusion:

Written by Lokesh Todwal