Low-Rank Adapter (LoRA) Explained

Sheli Kohan
6 min readOct 3, 2023

--

Paper | GitHub | HuggingFace Models

The paper “LoRA: Low-Rank Adaptation of Large Language Models,” authored by a team from Microsoft, including Edward Hu and Yelong Shen, has garnered attention in the field of NLP. Published in 2021, it introduces a groundbreaking approach to adapting massive language models efficiently.

As language models have grown in size, traditional fine-tuning methods have become impractical. LoRA addresses this issue by freezing pre-trained model weights and introducing trainable rank decomposition matrices, significantly reducing parameters while maintaining model quality.

LoRA’s impact on NLP is noteworthy, enabling cost-effective utilization of large models like GPT-3. This article explores LoRA’s principles, architecture, and impact on language model adaptation.

Aren’t Existing Solutions Good Enough?

As LLM magnitudes grow, our objective is to minimize alterations to their trained parameters. Best practices include employing strong regularization, small learning rates, and limiting the number of training epochs. Additionally, typically only the last layer or a few layers are fine-tuned to prevent catastrophic forgetting. These techniques are referred to as “adapter-tuning” because they involve adding “adapters” as additional layers, rather than modifying the base model’s parameters.

Adapter Layers suffer from Inference Latency. Waiting more than 10 seconds for an LLM answer isn’t pleasant. This happens because adapter layers are added one after another and must be processed sequentially and cannot be parallelized. To reduce latency, you can prune layers or use multi-task settings, but you can’t completely eliminate the extra computation in adapter layers. Latency worsens with small batch sizes, like single-GPU inference on models such as GPT-2, and worsens further with sharded models.

Another fine-tuning method involves tweaking the input layer’s activation. In the LoRA paper, they point out that directly fine-tuning the prompt is hard. It is difficult to optimize, and its performance changes non-monotonically in trainable parameters. Moreover, allocating part of the sequence length for prompt adjustments reduces the available sequence length for downstream tasks, which may make prompt tuning less effective than alternative approaches.

One potential solution to address these issues is to employ Parameter Efficient Fine-Tuning (PEFT) methods, with a specific focus on LoRa, which does not introduce additional layers or require prompt fine-tuning, but instead modifies the values of parameters. LoRA has gained prominence for its remarkable efficiency in optimizing pre-trained language models for diverse tasks.

Introducing LoRA

LoRA, which stands for “Low-Rank Adaptation”, distinguishes itself by training and storing the additional weight changes in a matrix while freezing all the pre-trained model weights. LoRA is not called an “adapter” because it does not add adapters. Instead, it is referred to as “adaptation” to describe the process of fine-tuning the domain data and tasks.

Besides, the term rankis a concept many of us encountered in linear algebra classes. In simple words, the rank of a matrix is calculated by counting how many of the rows are “unique,” meaning they are not linearly composed of other rows (the same applies to columns).

Now that we’ve covered the idea behind LoRA’s name, let’s get into its mathematics.

LoRA’s Architecture

To grasp the core concept, let’s use some notation:

  • W0: all parameters of a pre-trained model in a matrix.
  • ∆W: additional weight changes.
  • The final weights are obtained by adding these two matrices: W0 + ∆W.

The key innovation of LoRA lies in decomposing the weight change matrix ∆W into two low-rank matrices, A and B. Instead of directly training the parameters in ∆W, LoRA focuses on training the parameters in A and B matrixes. Let’s take a closer look at this process.

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS [1]

Assume the pre-trained weight matrix is W0 with a dimension of d * d. It will be frozen during model training. The updated matrix is ∆W with dimension d * d as well. The dimension of A is r * d, and that of B is d * r. When training is completed, W0 and ∆W will be stored separately. When a new input x enters the LoRA fine-tuned model, x will be multiplied with W and ∆W separately. Assume the dimension of x is 1 * d. So the dimension of x multiplying with W becomes 1 * d, and the dimension of x multiplying with ∆W is also 1 * d. The two output vectors are summed coordinate-wise to become the final output h so that h = W0x + ∆W x = W0x + BAx

To illustrate the redundant number of parameters, let’s look at an example. Let’s consider a scenario where ∆W contains 40,000 parameters, arising from a matrix size of 200 * 200. In this case, we can construct two matrices, A and B, with dimensions of 200 * 2 and 2 * 200, where d=200 and rank r=2. Remarkably, the combination of A and B only requires tuning 800 parameters in total, which significantly reduces the number of trainable parameters.

LoRA’s adapters: A and B [2]

Note that the Low-Rank is represented by the “r” hyperparameter. A small r will lead to fewer parameters to turn. While it will shorten the training time, it also could result in information loss and decrease the model performance as r becomes smaller.

It’s worth mentioning that the dimension notation d*d is used in the paper to emphasize the fine-tuning of the attention layer, where the input dimension equals the output dimension. However, it’s important to note that in practice, LoRA does not assume that input and output dimensions are necessarily the same, and this example serves as an illustration rather than a strict requirement.

For further explanations on LoRA’s architecture and code implementation of fine-tuning GPT, I recommend reading this detailed Medium Article.

Empirical Experiments

In the paper, the authors tested LoRA alongside other adaptation methods across various pre-trained models, including RoBERTa, DeBERTa, GPT-2, and GPT-3 175B, covering a range of natural language understanding and generation tasks. The methods under scrutiny included fine-tuning, BitFit, prefix-embedding tuning, prefix-layer tuning, and adapter tuning. The results showed that LoRA consistently outperformed or closely matched other methods, ensuring efficient model deployment without compromising performance across different models and tasks. These experiments highlight LoRA’s potential as an effective adaptation strategy for LLMs. for example, check out this table that compares fine-tuning methods of GPT-3:

Performance of different adaptation methods on GPT-3 175B. (from paper)

With our discussion of the results now complete, let’s shift our focus to wrap up and examine the developments that have occurred since LoRA’s publication.

Looking Ahead: Future Directions and Key Takeaways

In summary, LoRA is a groundbreaking solution for LLM adaptation, effectively addressing some major challenges in fine-tuning neural networks while reducing computational and storage costs. It excels in versatile applications with fewer trainable parameters, substantial reductions in parameters and GPU memory, enhanced efficiency through its low-rank design and modular architecture, and seamless deployment without inference latency. Moreover, it offers flexibility for customization and task switching with shared pre-trained models.

Because of these innovative features, LoRA has garnered significant attention within the data science community, leading to the emergence of several noteworthy extensions since 2021. Among these extensions, two have particularly piqued my interest.

To start, let’s take a closer look at QLora, released in May 2023, which efficiently reduces memory usage, enabling the fine-tuning of a 65B parameter model on a single 48GB GPU without compromising 16-bit fine-tuning task performance.

Moreover, LongLora was released in September 2023, which extends the context sizes of pre-trained LLMs without incurring significant additional computational costs.

I look forward to sharing more about these developments in the field in my upcoming articles on LongLora. Stay tuned!

References

[1] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, & Weizhu Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.

[2] LoRA explained (and a bit about precision and quantization) YouTube.

[3] Fine-tuning a GPT — LoRA medium.

--

--