The intuitive idea behind Low-Rank Adaptation (LoRA)

Ogban Ugot
6 min readOct 17, 2023

--

Photo by Sebastian Svenson on Unsplash

There are these brilliant yet simple ideas you come across in ML research and you just wonder, this is so intuitive why didn’t I think of this lol? I think LoRA is one of those, and in this article, I’ll explain why. I’ll be referencing the LoRA Paper¹, so be sure to have it on standby.

The motivation for LoRA

The concept of fine-tuning pre-trained neural networks has been around for a while now. It’s all a form of transfer learning. We take a pre-trained neural network (one that has been trained on some general task) and fine-tune it on a specific downstream task (question answering, summarization, etc.). In theory, any neural network can be fine-tuned this way.

Fine-tuning deep neural networks usually involves freezing the weights of a part of the model so that they are not adjusted during training and then fine-tuning (adjusting the weights of) other parts of the model. For example, when you fine-tune convnets like VGG16, you usually freeze the convolutional layers, and then fine-tune the fully-connected layers to learn to classify a new image set.

It’s typical for the “unfrozen” parts of the model whose weights are adjusted during fine-tuning to have a smaller number of trainable parameters compared to the full model. As mentioned earlier, in the VGG16 model we typically fine-tune only the fully connected layers. There are three fully connected layers (also referred to as dense layers) and we can estimate the number of parameters in these layers:

The first fully connected layer (fc1) has 4096 neurons. This layer connects to the output of the last convolutional layer. Each neuron has connections to all the neurons in the previous layer, so this layer has:

Number of parameters in fc1 = (Number of input features) * (Number of neurons) + (Number of neurons) = (7 * 7 * 512) * 4096 + 4096 = 102,764,544 parameters.

The second fully connected layer (fc2) also has 4096 neurons, and it connects to the output of the first fully connected layer. Like the previous layer, it has:

Number of parameters in fc2 = 4096 * 4096 + 4096 = 16,781,312 parameters.

The final fully connected layer is the output layer, which is used for the specific downstream classification task at hand. In the original VGG16 model, this layer typically has 1000 neurons (corresponding to 1000 classes in the ImageNet dataset). So, the number of parameters in this layer is:

The number of parameters in the output layer = 4096 * 1000 + 1000 = 4,097,000 parameters.

So, the estimated total number of parameters in the fully connected layers of VGG16 is approximately 102,764,544 (fc1) + 16,781,312 (fc2) + 4,097,000 (output layer) = 123,642,856 parameters. You can initialize and train the fully connected layers of VGG16 on an RTX 3090 24GB GPU, without any problems.

Consider a large language model like GPT-3 with 175B trainable parameters. In the LoRA paper, the authors focus on fine-tuning only the attention weights for downstream tasks and freeze the MLP layers, LayerNorm layers, and biases. In a standard Transformer model, the self-attention mechanism consists of query (Q), key (K), and value (V) weight matrices. Typically, these matrices have the following shapes:

  • Wq (Query Matrix): [d_model, d_k]
  • Wk (Key Matrix): [d_model, d_k]
  • Wv (Value Matrix): [d_model, d_v]

Here, d_model represents the dimension of the model, d_k and d_v are the dimensions of the key and value vectors, respectively.

Again, for GPT-3–175B, the model size is 175 billion parameters, and a substantial portion of this model is devoted to the attention weights. The specific dimension of d_model would be defined for the model, and the exact values of d_k and d_v might vary depending on the architecture and hyperparameters chosen for this specific GPT-3 variant.

Unfortunately, without precise information about the exact model architecture and hyperparameters of GPT-3–175B, we can’t calculate an exact number of trainable parameters for the attention weights, just like we did for fully connected layers of VGG16. However, we know that a significant portion of the 175 billion parameters in GPT-3–175B is allocated to these attention weights due to the large-scale nature of the model. We can estimate the total number of trainable parameters for the attention weights north of 100 billion. Fine-tuning the attention weights of LLMs like GPT-3 is cost-prohibitive and requires an immense number of specialized GPU hardware usually connected in parallel. What if there was a way to optimize the fine-tuning of the large training parameters of LLMs and still achieve efficient results? That is the motivation for LoRA.

The intuitive idea behind LoRA

In neural networks, the backpropagation algorithm calculates the error between the expected value and the actual value, this error eis then used to calculate the delta which is the contribution to e from the weights in the neural network. So if you have the initial weights of a neural network W0 then with respect to the error e we calculate delta_W0 = ∆W . You then use ∆W to update the weight W0 + ∆W in order to reduce the error e . LoRA proposes that ∆W can be decomposed into two sets of Low-Rank matrices A and B such that W0 + ∆W = W0 + BA .

LoRA re-parametrizes the training of the attention weights to A and B only

Instead of using the full ∆W update, we use the smaller low-rank update matrix BA , this is how we achieve efficiency and lower computational requirements. If the size of ∆W is (d x k) (the size of W0 ) then we decompose ∆W into two matrices: B and A, with dimensions (d x r) and (r x k) , where r is the rank. Let’s tie all this to how it is described in the paper. Take the following quote from the introduction of the paper:

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen

Drawing inspiration from Li³ and Aghajanyan², LoRA shows that the changes to the weights ∆W also have a low intrinsic rank that can be represented by decomposing ∆W into two low-rank matrices A and B . LoRA confirms that ∆W does in fact reside on a low intrinsic dimension by achieving efficient training results using the low-rank decomposition of ∆W (BA) for the weight updates.

Conclusion

The idea behind LoRA is simple and intuitive because we apply matrix decomposition to the update matrix ∆W reducing its size considerably. However, we can take the low-rank form of ∆W i.e BA and use it to adapt the attention weights of the transformer and still get great results. Thus, instead of using the full ∆W to adapt the full attention weights (which is very large). We use the low-rank BA and instead, adjust a smaller number of parameters in the attention weights.

[1]: LoRA: Low-Rank Adaptation of Large Language Models https://arxiv.org/abs/2106.09685

[2] Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning https://arxiv.org/abs/2012.13255

[3] Measuring the Intrinsic Dimension of Objective Landscapes. https://arxiv.org/abs/1804.08838

--

--