LLM Fine Tuning Series

Shubhagyta Jayswal

Published in

Walmart Global Tech Blog

8 min readMay 17, 2024

LLM Fine Tuning Series

2nd in series: Reparameterization Tuning Theoretical

Understanding LoRA: Balancing Efficiency and Adaptability in Fine-Tuning of Large Language Models

Link to previous article in the LLM fine tuning series — 1st in series : In Context Learning

Introduction

Fine-tuning has been a common practice in machine learning and natural language processing (NLP) to specialize in models for specific tasks. However, it is both storage intensive and time consuming. To combat these challenges, the concept of prompt tuning came into the picture. It enhances input prompts to persuade the pretrained model into a region of its latent space that aligns with the user’s intent.

But when it comes to highly specialized tasks, like critical health care advice or domain-specific customization, prompt tuning might not be sufficient. The solution lies in fine-tuning. However, fine-tuning comes with its own set of challenges including the risk of overfitting or the phenomenon of catastrophic forgetting.

How to choose between prompt tuning and fine-tuning?

The decision between two revolves around several factors but before that one needs to be aware of their requirements and expectations out of the model. If user wants to perform some general QnA task, then prompt tuning can be preferred but if there is a requirement of more specialized task like critical health care advice, or domain specific customization, or critical document analysis, then fine tuning is the way. However, in practice, both techniques can be combined to balance customization and efficiency.

Based on the current developments, here’s a summary of the situations in which you might prefer one method over the other:

For example, in the case of Customization vs Generalization: If you need a model that is tailored to specific requirements and preferences, fine-tuning would be more suitable. For example, in the retail and e-commerce domain, you might want the model to understand specific product categories, brands, or customer preferences to provide personalized recommendations. Fine-tuning allows you to train the model on your specific dataset and optimize it for your unique needs.

The choice between fine tuning and prompt tuning depends on the task, available resources and desired output.

In a previous blog in this series, we introduced in-context learning. Here, we will explore another technique of fine tuning.

Quick refresher from our previous article:

Full fine tuning is a process where all the model’s weights are updated, resulting in a new version of the model with updated weights. Like pre-training, full fine tuning requires enough memory and compute budget to store and process all the updated gradients during training. Additionally, when fine-tuning on a single task, it may lead to catastrophic forgetting.

To avoid this, a method known as Parameter Efficient Fine Tuning (PEFT) was developed. Unlike full fine tuning where every model weight is updated during supervised learning, PEFT only updates a small subset of parameters. There are several PEFT methods — selective, re-parametrization, and additive.

1. Selective Fine-tuning: Only a subset of the model’s weights is updated. This is computationally efficient but may not fully utilize the pre-trained model’s capacity.

2. Additive Fine-tuning: New layers are added on top of the pre-trained model. Although it could capture task-specific features more effectively, it might require a lot of data to avoid overfitting and increases the model size and computational cost.

3. Re-parametrization: A low-rank matrix is introduced to adapt the pre-existing model layers. This approach strikes a balance between adaptation and preservation and is computationally efficient. A commonly used technique of this type is LoRA, which we’ll explore in detail in this blog.

In this blog, our attention will be solely on an approach to fine-tuning known as Low-Rank Adaptation (LoRA). This method strives to achieve a balance between the computational and storage efficiency seen in prompt tuning and the adaptability and personalization capabilities of fine tuning.

What is LoRA?

LoRA is a low-rank adaptation algorithm for efficient and computationally inexpensive fine-tuning. The main concept involves incorporating low-rank parameters into existing model layers during the fine-tuning process, which enhances the model’s ability to learn task-specific features.

The method involves freezing the pre-trained model’s weights and embedding trainable rank decomposition matrices into each layer of the transformer architecture. This substantially decreases the number of trainable parameters for downstream tasks.

Exploring LoRA in Depth:

To simplify, imagine having a large, complex puzzle (which is the LLM). Instead of attempting to rearrange the entire puzzle (which would be computationally costly), we introduce a few new pieces (the low-rank matrix r) that enable us to adapt the puzzle to a new image (the specific task) more efficiently.

The concept involves integrating a low-rank matrix into existing model layers during the fine-tuning process. This can be represented mathematically as follows:

Suppose W is the original weight matrix of a layer in the pre-trained model. During fine-tuning, instead of directly modifying W, we introduce a low-rank matrix L as an adaptation to W. L is the product of two smaller matrices, B and A.

L = B * A

Here, B is a matrix of size d x r, and A is a matrix of size r x n where d is the input dimension, n is the output dimension, and r is the rank of L. The rank r is typically chosen such that r << d, n, hence L is referred to as a “low-rank” matrix.

The adapted weight matrix (W’) for the layer is then given by,

W’ = W + L

This indicates that the original weight matrix W is being adapted by the low-rank matrix L.

So, the output can be computed as h = (W+L) *x, where x represents the inputs as illustrated below:

Example: To illustrate a realistic example, Using the base Transformer model presented by Vaswani et al. 2017:

Let’s say, the transformer weights have dimensions of d by n = 512 by 64

No of trainable parameters in weight matrix = 512*64 = 32,768

Now, apply LoRA with a rank of = 8. So according to the above explanation, now

Matrix A (r by n) = (8 by 64) => Total parameters = 8*64 = 512
Matrix B (d by r) = (512 by 8) => Trainable parameters = 4,096

Hence, by modifying the weights of these newly formed low-rank matrices instead of the initial weights, you will be training 4,608 parameters as opposed to 32,768, marking an 86% reduction.

The integration of this low-rank matrix doesn’t significantly alter the model’s structure but enables it to adapt more efficiently to the specific task during fine-tuning. The rank of the matrix L (denoted as r) determines the complexity and flexibility of the adaptation. A higher rank provides more flexibility but requires more computational resources.

Furthermore, this low-rank adaptation enables the model to capture task-specific features more effectively since the adaptation matrix L is learned from the task-specific fine-tuning data.

The significant advantage of LoRA is that it allows a substantial cut in the number of trainable parameters, which often permits the execution of this method of efficient fine-tuning on a single GPU, eliminating the need for a distributed GPU cluster.

According to the paper titled “LoRA: Low-Rank Adaptation of Large Language Models,” 2021): Compared to traditional fine-tuning methods, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. Moreover, it allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation, while keeping the pre-trained weights frozen.
Paper titled “Intrinsic dimensionality explains the effectiveness of Language Model Fine Tuning” states that pretrained models have very low intrinsic dimensions so what this essentially means is that they can be described as accurately or almost as accurately using way fewer dimensions than they have.
For instance (as stated in reference 1), using GPT-3 175B as an example, a very low rank suffices even when the full rank is as high as 12,288, making LoRA both storage- and compute-efficient. Furthermore, it introduces no additional latency during the inference phase.

Key parameters of Lora:

r(int): It is used to determine the rank of the low rank matrices learned. This rank value plays a crucial role in the adaptation process.
target_modules (Optional [Union [List[str], st]]): During fine-tuning with LoRA, it is possible to selectively target specific modules in the model architecture using the target_modules parameter. These targeted modules are then updated with the learned matrices. However, it’s important to note that targeting more modules increases training time and computational resources required. Typically, it is common to focus on the attention blocks of the transformer. However, recent research suggests that targeting all linear layers can lead to better adaptation quality.
lora_alpha (int): It is used to scale the weight matrices during the adaptation process. It allows for adjusting the importance of the learned matrices.
lora_dropout(float): It sets the dropout probability for the LoRA layers. This helps in regularizing the training process and preventing overfitting.
bias: Determines the type of bias used in LoRA, which can be ‘none’, ‘all’, or ‘lora_only’. If ‘all’ or ‘lora_only’ is selected, the corresponding biases are updated during training. It’s important to be aware that enabling bias updates even when disabling the adapters can result in different model output compared to the base model without adaptation.

How to choose optimal rank r?

Selecting the rank for LoRA matrices is a crucial aspect and is currently a subject of extensive research. The lower the rank, the fewer the trainable parameters, which leads to a reduction in computation. However, the trade-off related to model performance should be considered. Microsoft researchers, in their initial paper on LoRA, explored different ranks and their impact on language generation tasks. The table below presents the rank of LoRA matrices, the ultimate loss value, and scores for various metrics like BLEU and ROUGE.

*Source: Hu et al. 2021, “LoRA: Low-Rank Adaptation of Large Language Models”*

The best scores for each metric are highlighted. The research identified a plateau in the loss value for ranks over 16, indicating that larger LoRA matrices did not enhance performance. Therefore, using ranks between 4–32 could offer a good balance between reducing trainable parameters and maintaining performance.

Next Step: Research on the optimal rank selection

Conclusion

LoRA presents a promising approach to fine-tuning large language models, offering a balance between computational efficiency, storage demands, and model adaptability. It allows us to leverage the full potential of pretrained models while making them more task-specific without the need for heavy computational resources.

In our upcoming blogs, we’re going to dive into the practical side of LoRA. We’ll explore a variety of use cases, demonstrating how this innovative technique can be applied to solve complex tasks. And of course, we’ll provide code snippets, allowing you to implement this solution in your own projects.

References

[LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS] — https://arxiv.org/pdf/2106.09685.pdf
[Intrinsic dimensionality explains the effectiveness of Language Model Fine Tuning] — https://arxiv.org/abs/2012.13255
https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

Note

Figure 1 has been taken from Hugging Face LoRA
Title image has been taken from MS Word Stock Images

#LoRa #LLMFineTuning #ReparametrizationMethod #PromptTuning

LLM Fine Tuning Series

2nd in series: Reparameterization Tuning Theoretical

Understanding LoRA: Balancing Efficiency and Adaptability in Fine-Tuning of Large Language Models

Written by Shubhagyta Jayswal