PEFT: The Key to Efficient LLMs

8 min readAug 24, 2024

Fine-tuning Large Language Models (LLMs) is essential for tailoring them to excel at specific tasks. However, full fine-tuning of these massive models comes with significant challenges. Not only do you need memory to store the model itself, but you also need additional resources to handle the optimizer states, gradients, forward activations, and other parameters required during training. These extra demands can quickly overwhelm available computational hardware, creating a bottleneck in the fine-tuning process.

This is where Parameter Efficient Fine-Tuning (PEFT) offers a smart solution. Instead of fine-tuning the entire model, PEFT allows you to adjust only a small subset of parameters, making the process much more memory-efficient. During inference, the fine-tuned weights from PEFT are seamlessly combined with the original, frozen model weights.

Because LLMs are often used for multiple tasks, PEFT enables the creation of task-specific weights. Each set of PEFT weights is fine-tuned for a particular task and can be selectively applied during inference, ensuring the model performs optimally across different tasks without the need for full retraining.

PEFT weights are individually trained for different tasks.

The PEFT techniques can be broadly classified into three categories:

Selective: Freeze most of the model layers and only fine-tune a small subset.
Reparameterization: Reduces the number of parameters to train by creating a new low rank transformations of the original network weights. Low-Rank Adaptation (LoRA) is one such famous PEFT technique to do so.
Additive: Freeze all the layers of the model, and instead add a small number of new parameters or layers and fine-tune only the new components. It has two main approaches as follows:

— — Adapters: Adds new trainable layers to the model architecture, typically after the attention or feed-forward layers.

— — Soft Prompts: This technique focuses on manipulating the input to achieve better performance, while keeping the model architecture fixed. It requires storing a small task-specific prompt for each task, making it easier to reuse a single frozen model for multiple downstream tasks, unlike model tuning, which requires making a task-specific copy of the entire pre-trained model for each task.

Note: Prompt tuning is not prompt engineering!

Source: https://medium.com/@aabhi02/prompt-engineering-vs-prompt-tuning-a-detailed-explanation-19ea8ce62ac4

LoRA

How LoRA helps?

We saw in the previous section that fine-tuning LLMs for specific tasks can be resource-intensive because these models have millions or even billions of parameters. LoRA (Low-Rank Adaptation) provides a more efficient way to fine-tune these models without needing to update all their parameters.

How does LoRA work?

Start with a Pretrained Model: You begin with a large language model (LLM) that has already been trained on a massive amount of data. This model has millions or billions of parameters.
Identify Key Layers: The model has many layers, each with parameters (like weights). LoRA focuses on a specific part of these layers, usually the weights in linear layers (which are matrix multiplications).
Decompose the Weight Matrix: Instead of updating the entire weight matrix (which is big and resource-intensive), LoRA decomposes this matrix into two smaller matrices that, when multiplied together, approximate the original matrix. These smaller matrices have fewer parameters (this is called “low-rank” because the matrices have a lower rank, or complexity).

Decomposition of weight matrix into low rank matrices

5. Insert the Low-Rank Matrices: These low-rank matrices are inserted into the model’s architecture. They are initialized with small values (often zeros), meaning they don’t change the model’s predictions initially.

6. Fine-Tune Only the Low-Rank Matrices: During fine-tuning, instead of updating the whole model, only the parameters in these low-rank matrices are updated. The rest of the model’s parameters remain fixed.

7. Combine with the Original Weights: After training, the updated low-rank matrices are combined with the original weight matrices. This allows the model to benefit from the new information captured during fine-tuning without having to adjust all of its parameters.

Combine low rank matrices with original matrix

8. Final Model: The result is a fine-tuned model that performs well on the new task but with much less computational cost and memory usage than if the entire model were fine-tuned.

Let’s use the Transformer architecture as an example to illustrate how LoRA makes computation more efficient during fine-tuning:
Scenario: In a Transformer model, each attention head has its own set of weight matrices for processing input (query, key, and value). If each of these matrices is 768 x 768 in size, there are a lot of parameters to fine-tune.
Parameters to Update: For one attention head, you’d need to update about 1.77 million parameters. With 12 heads, this adds up to over 21 million parameters.
Computation Cost: Fine-tuning all these parameters is very resource-intensive, requiring lots of memory and processing power.
Instead of fine-tuning the entire 768 x 768 matrices, LoRA breaks each into two smaller matrices, for example, one 768 x 8 and one 8 x 768. This drastically reduces the number of parameters. Now, each matrix has only about 12,288 parameters, and for 12 heads, this totals around 442,368 parameters — much less than 21 million.

LoRA can be used for different set of tasks efficiently by training different low rank matrices for different tasks, and later combine the relevant trained low-rank matrices with the original weight matrix in the inference mode for a particular task. The memory required to store these low rank matrices is very small which makes it quite efficient.

LoRA can be further enhanced by combining it with model quantization techniques, resulting in an even more efficient approach known as QLoRA. This method reduces the number of parameters needed during fine-tuning, making the process more resource-efficient. To learn more about how to quantize deep learning models and leverage the benefits of QLoRA, refer to this article.

Prompt tuning

How Prompt tuning helps?

When you have a pre-trained LLM, it already has a vast amount of knowledge from its training data. However, to get the model to perform well on a specific task, like summarizing news articles, you typically need to give it a prompt that guides it to produce the right kind of output.

If you simply give the model a general prompt like, “Summarize this article,” the model might produce varying quality in its summaries. It may not always understand exactly what you’re asking for, or it might give summaries that are too long, too short, or not focused on the key points.

Instead of retraining the whole model on a large dataset (which is computationally expensive), you can fine-tune the prompt itself. This means you iteratively adjust the wording, structure, or even the length of the prompt to find a version that consistently leads the model to produce high-quality summaries.

Let’s say you start with the prompt, “In summary, the article discusses…”. You might find that the model gives better summaries if you tweak this to, “To briefly summarize, the article focuses on…”. Over time, through experimentation or even automated optimization, you discover the prompt that works best for summarization.

How does Prompt tuning work?

The goal of prompt tuning is to find the best version of the prompt that leads to the most accurate or useful output for your specific task, while keeping the pre-trained weights of the model fixed.

Start with a Pretrained Model: Begin with a LLM that has been pretrained on a wide range of text.
Design a Task-Specific Prompt: For the task you want the model to perform, you create an initial prompt. This could be a few words, a sentence, or even a paragraph that provides context or a question for the model to respond to.
Optimize the Prompt: Instead of adjusting the model’s internal parameters, prompt tuning focuses on tweaking the prompt itself. Additional trainable tokens are added to the original prompt and are left up to the supervised learning process to determine their optimal values. These additional parameters appended to the actual prompt are called Soft Prompts.

These soft prompts are not interpretable i.e. they are not fixed discrete words of a natural language, instead you can think of them as a virtual tokens that can take on any value within the continuous multi-dimensional embedding space. However, once these soft prompts are learnt, using the nearest neighboring approach, we can see what these soft prompts are closest related to the actual tokens in the embedding space.

4. Use the Tuned Prompt: Once the prompt is optimized, it can be used with the pretrained model to perform the task. The model, guided by the tuned prompt, generates responses that are more aligned with the task requirements.

Similar to LoRA, Prompt tuning can also be used for multiple tasks by training different sets of soft prompts for different types of tasks. Later, during the inference stage, the soft prompts trained for a specific task can be prepended to the actual prompts to carry out that task.

The figure below shows how prompt tuning performs equally well compared to full fine-tuning the LLMs for multiple tasks as the model size increases.

Performance of prompt tuning vs full fine tuning. (https://arxiv.org/pdf/2104.08691)

In this article, we saw different PEFT techniques we can use to reduce the computational burden of fine-tuning the full model architecture. Such approaches makes these powerful LLM models more accessible and adaptable for a variety of tasks. As the demand for specialized applications continues to grow, PEFT provides a scalable, resource-efficient pathway to use the full potential of LLMs.

Thanks for reading, hope it helps!

Resources

https://www.coursera.org/learn/generative-ai-with-llms/
https://www.leewayhertz.com/parameter-efficient-fine-tuning/#Parameter-efficient-fine-tuning-techniques
LoRA: https://arxiv.org/abs/2106.09685
The Power of Scale for Parameter-Efficient Prompt Tuning: https://arxiv.org/pdf/2104.08691