Sharpen Your LLMs with Efficient Fine-Tuning Techniques

Computational Efficient LLM Fine-Tuning Methods: LoRA, DoRA and ReFT

Maddie Lupu
9 min readApr 14, 2024
Photo by Jess Bailey on Unsplash/

Large Language Models(LLMs) can do surprisingly well with zero shot inference. They can attempt solving problems without knowing the context of your task or how exactly the output should look like. It performs even better if you show what you expect from it with a few relevant examples. The latter technique is known as few-shot inference or learning.

Both of these methods involve interacting with the LLM through a prompt – an instruction message sent to the model which can be tweaked to optimise the LLM output. Known as prompt engineering, while powerful, this method has its limitations. Here are some key prompt engineering challenges:

  • Limited context window — LLMs typically have a pre-defined context window. This means everything you throw at it in the input text needs to fit within this space.
  • Smaller models may struggle — Smaller LLMs might have less capacity to effectively process in-context examples and understand the desired output, especially for complex tasks.
  • Domain knowledge — If the task is highly specialised or niche, even multiple and diverse examples may not provide sufficient context for the model to perform well.

Fortunately there are other methods practitioners can use to optimise and fine tune LLMs’ results!

Full Parameter Fine-Tuning

Full Parameter Fine-Tuning takes a pre-trained model and trains again all the model’s layers on a smaller, specific dataset with the goal to improve performance for a particular task.

We are going to be diving into several Fine-Tuning methods. To make things easier, let’s use an education analogy.

Imagine you want to become a Generative AI expert — Full Fine-Tuning would be like enrolling in a specialised Generative AI program with an extensive curriculum. It’s in-depth, but takes time and significant resources.

While Full Fine-Tuning can be performant, it comes with significant drawbacks.

Training all the model’s layers can become computationally expensive, especially for very large models. This can translate to longer training times and higher resource consumption. Catastrophic forgetting can occur during Full Fine-Tuning — in this scenario, the model forgets the general knowledge it learned during pre-training in favour of the new, specific task. This throws away the value of the pre-trained model.

Since all the model’s weights are updated during Full Fine-Tuning, a completely new copy of the model is created. This can significantly increase the storage requirements of the LLM.

Parameter-efficient fine-tuning(PEFT) techniques offer an effective alternative for adapting LLMs. These methods can achieve results as good as, or even better than, Full Fine-Tuning, while keeping costs down.

What is PEFT?

Unlike Full Fine-Tuning, Parameter-Efficient Fine-Tuning (PEFT) updates a limited subset of the model’s existing parameters, often in the self-attention layers, using a relatively small dataset of thousands of data points.

What makes PEFT stand out?

Reduced memory footprint — PEFT only updates a limited subset of the model’s parameters, while the original weights remain frozen. This significantly reduces the model’s memory requirements, making it easier to train and deploy on even single GPUs.

Faster training — with fewer parameters to train, PEFT can be performed on a single GPU. This avoids the need for expensive and complex distributed clusters of GPUs, leading to faster training times and lower overall costs.

Improved portability — PEFT’s smaller size makes it more portable across different hardware configurations. This is especially beneficial if you plan to fine-tune the model for multiple tasks.

PEFT Methods

There are certain variations of PEFT which include adjusting just a few layers or adding a small number of new parameters to fine tune only the new components. Here is an overview of PEFT Methods:

PEFT methods by the author

Adapter based models train additional layers on top of the frozen pre-trained method. Since this PEFT method adds new components that cannot be easily added into existing models’ layers, makes it less efficient as it introduces an additional burden at inference time.

Soft Prompts add randomly initialised soft tokens to the input prompt and train their embeddings while keeping the LLM weights frozen. The performance of this method is less optimal compared to other PEFT methods and comes with significant inference overhead.

LoRA injects low-rank matrices to train and requires no additional overhead as the model’s weights can be easily added to the original layers.

DoRA is a brand new method which refines LoRA’s low-rank adaptation by decomposing weights into magnitude and direction, allowing for more focused training compared to LoRA.

Selective methods focus on a subset of the original LLM weights and retrains them. This method has shown mixed results in the past.

LoRA is one the most performant PEFT methods that’s been around for a while, so let’s do a dive in and see how it works!

Low Rank Adaptation of LLM

Back to becoming a Generative AI expert story, LoRA is like learning using flashcards. You focus on key concepts, making learning faster (faster training) and less resource-intensive (lower memory usage). However, there’s a risk you might miss crucial details.

How to apply LoRA

Let’s put things in perspective. Inside the traditional Transformer with dense layers, you have 2 kinds of neural networks: the self attention network and the feed forward network. The weights of these are learned during pre-training. Check below the Transformer architecture for a refresher.

Transformers Architecture adapted by the author

After the Embedding vectors are created, these are fed into the self attention layers where attention scores are calculated. It turns out that applying LoRA just to the self-attention layers of the LLM works well enough to yield good results and save compute.

Essentially, you can apply it to the feed forward layer as well but since most of the parameters of the LLM are in the self-attention layer, you are going to get the biggest savings in trainable parameters by applying LoRA to these weights matrices.

LoRA can be applied to both the encoder and decoder. However, to illustrate its workings more clearly, let’s focus on the encoder component and develop an intuition of how LoRA is applied there.

Add LoRA to the self-attention layer

LoRA process high level

1. Freeze most of the original LLM weights

2. Inject 2 rank decomposition matrices

3. Train the weights of the smaller matrices

Let’s take the base Transformer of dimensions 512 x 64 and illustrate how to apply low rank decomposition matrices to it.

In linear algebra, a low rank decomposition refers to expressing a matrix as a product of two smaller matrices. Let’s choose a rank of 8 to decompose the original transformer matrix which has 32.768 trainable parameters. If decomposed with a rank of 8, this results in only 4.608 trainable parameters.

Steps to update the model for inference

The resulting model has the same number of parameters as the original one so there is no latency overhead.

An essential advantage of LoRA is that it’s flexible. Suppose you train a pair of LoRA matrices for a specific task A. To carry out inferences for this task, you multiply the matrices and add them to the frozen layers. You then replace these in the original model. If you have a Task B, you (can pretty much) repeat the same process. The memory required to store these LoRA matrices is very small and you can switch out the weights when you need them.

How to choose the rank of a matrix?

The smaller the rank → the smaller the number of parameters -> the bigger the savings

In the paper that first introduced LoRA, Microsoft researchers found a plateau in the loss value for ranks greater than 16. See below rank performance across a set of metrics.

The authors suggest that a 4 to 32 rank range provides a good trade-off between reducing the number of trainable parameters and preserving performance.

Source: LoRA paper

Fine tuning LLM with LoRA results in good enough performance to accept the efficiency and performance trade-off. Models fine tuned with LoRA show almost as good results as fully fine tuning.

For an even more memory-efficient and potentially device-agnostic approach, QLoRA takes LoRA’s adaptation a step further by leveraging Quantization. Explore more about QLoRA here and delve deeper into Quantization in my previous article.

Weight-Decomposed Low-Rank Adaptation (DoRA)

One of the newly published methods in February this year is DoRA which builds upon LoRA by introducing a weight decomposition technique that separates the magnitude and direction of the weight updates. DoRA promises similar benefits to LoRA (faster training, lower memory usage) while potentially achieving better performance due to its finer-grained control over weight updates.

To build intuition of this method, back to becoming a Generative AI expert, this builds on LoRA by creating improved flashcards. Your flashcards have weights based on their importance to your subject enabling more nuanced and focused learning.

How DoRA works. Source: DoRA research paper

Representation Fine-tuning (ReFT)

Representation Fine-tuning(ReFT) is a brand new method introduced this month which takes a new approach towards fine-tuning LLMs. Similar to other PEFT methods, this operates on a frozen base model and learns task-specific interventions on hidden representations.

Instead of adapting model’s weights, ReFT methods train interventions that manipulate a small fraction of the model representations in order to nudge the model to solve downstream tasks.

Low-rank Linear Subspace(LoReFT) is one instance of this methods and promises to be 10x — 50x more parameter efficient than prior PEFTs. This intervenes on hidden representations in the linear subspace by a low projection matrix. The intuition is that it performs interventions that lead the model to accurately make predictions for the task at hand.

LoReFT shows promising results outperforming PEFT methods on a set of tasks, while requiring a smaller number of parameters.

LoReFT outperforming PEFT on several tasks. Source: ReFT paper

Back to becoming a Generative AI expert analogy. Chances are that you might have some knowledge of AI and you want to take it to another level, ReFT is like targeting weak spots in your AI knowledge, much like deliberate learning. This can result in faster and more focused learning than flashcards (LoRA/DoRA).

While promising, DoRA and ReFT are still under development, and more research is needed to fully understand its capabilities compared to other fine-tuning methods.

Closing words

In this article, we delved into various techniques for computationally efficient fine-tuning of large language models (LLMs). These techniques, including PEFT, LoRA, DoRA, and ReFT, aim to achieve good performance while minimising training time and memory usage.

To solidify our understanding of these techniques, let’s recap them quickly through the lens of the learning process to become a Generative AI expert:

  • Full Fine-Tuning — this is like enrolling to a comprehensive AI class and going through the entire curriculum. While in-depth, it takes a longer time and requires significant memory space.
  • LoRA (Low-Rank Adaptation) — this is like learning using flashcards. You focus on key concepts, making learning faster (faster training) and less resource-intensive (lower memory usage). However, there’s a risk of missing crucial details.
  • DoRA (Weight-Decomposed Low-Rank Adaptation) — this builds on LoRA by creating improved flashcards. Your flashcards have weights based on their importance to the AI subject enabling more nuanced and focused learning.
  • ReFT (Representation efficient Fine-Tuning) — this takes a different approach and it’s like practicing deliberate learning on your weak spots. This can be even faster and potentially more effective than flashcards (LoRA) or improved flashcards (DoRA)

Making LLMs accessible and efficient is an active area of research with methods being introduced at a fast pace such as DoRA and ReFT. These incremental and rapid innovations show great potential to democratize AI and make it accessible across a variety of industries.

Thank you for reading!

If you enjoyed it, show some love with a few claps below 👏

Find me on LinkedIn!

References

LoRA: https://arxiv.org/pdf/2106.09685.pdf

DoRA: Weight-Decomposed Low-Rank Adaptation: https://arxiv.org/pdf/2402.09353.pdf

ReFT: Representation Finetuning for Language Models: https://arxiv.org/pdf/2404.03592.pdf

Generative AI with LLMs: https://www.coursera.org/learn/generative-ai-with-llms/home/week/2

--

--