PEFT — Parameter Efficient Fine Tuning

kanika adik
6 min readJul 16, 2023

--

Parameter Efficient Fine-Tuning (PEFT) updates only a small subset of parameters. This helps prevent catastrophic forgetting.

Parameter Efficient Fine-Tuning (PEFT) methods specifically attempt to address some of the challenges of performing full fine-training.

Computational constraints: Because most parameters are frozen, we typically only need to train 15%-20% of the original LLM weights, making the training process less expensive (less memory required)

Storage requirements: With PEFT, we can change just a small amount of parameters when fine-tuning, so during inference you can combine the original model with the new parameters, instead of duplicating the entire model for each new task you want to perform fine-tuning.

Catastrophic forgetting: With PEFT, most parameters of the LLM are unchanged, and that helps make it less prone to catastrophic forgetting.

Performing full-finetuning can lead to catastrophic forgetting because it changes all parameters on the model. Since PEFT only updates a small subset of parameters, it’s more robust against this catastrophic forgetting effect.
Training LLMs is computationally intensive. Full fine-tuning requires memory to -
- store the model
- various other parameters are required during the training process
optimizer states, gradients, forward activations, and temporary memory throughout the training process

In PEFT — most if not all of the LLM weights are kept frozen
The number of trained parameters is much smaller than the number of parameters in the original LLM.

In some cases, just 15–20% of the original LLM weights. This makes the memory requirements for training much more manageable.

The new parameters are combined with the original LLM weights for inference. The PEFT weights are trained for each task and can be easily swapped out for inference, allowing efficient adaptation of the original model to multiple tasks. There are several methods you can use for parameter-efficient fine-tuning, each with trade-offs on parameter efficiency, memory efficiency, training speed, model quality, and inference costs.

Three main classes of PEFT methods- Selective methods are those that fine-tune only a subset of the original LLM parameters.

Researchers have found that the performance of these methods is mixed and there are significant trade-offs between
- parameter efficiency
- compute efficiency

There are several approaches that you can take to
— Selective method: identify which parameters you want to update, train only certain components of the model or specific layers, even individual parameter types
— Reparameterization methods: also work like LoRA
— Additive methods: carry out fine-tuning by keeping all of the original LLM weights frozen and introducing new trainable components

Adapter methods add new trainable layers to the architecture of the model, typically inside the encoder or decoder components after the attention or feed-forward layers.

Soft prompt methods keep the model architecture fixed and frozen, and focus on manipulating the input to achieve better performance.
This can be done by
— adding trainable parameters to the prompt embeddings
— keeping the input fixed and retraining the embedding weights

LoRA — Low-Rank Adaptation

The goal is to find an efficient way to update the weights of the model

Transformers basic structure is built with Encoder Decoder and Self attention and Feed forward networks

without having to train every single parameter again.

There is a slight modification in the self-attention network as follows

Since most of the parameters of LLMs are in the attention layers, you get the biggest savings in trainable parameters by applying LoRA to these weights matrices.

How does LoRA reduce parameters to train?

Since LoRA allows you to significantly reduce the number of trainable parameters, you can often perform this method of parameter-efficient fine-tuning with a single GPU and avoid the need for a distributed cluster of GPUs.

Since the rank-decomposition matrices are small, you can fine-tune a different set for each task and then switch them out at inference time by updating the weights.

You can use LoRA to train many tasks —

Suppose you train a pair of LoRA matrices for a specific task; let’s call it Task A.
To carry out inference on this task,you would multiply these matrices together and then add the resulting matrix to the original frozen weights.
You then take this new summed weights matrix and replace the original weights where they appear in your model.

You can then use this model to carry out inference on Task A. If instead, you want to carry out a different task, say Task B, you simply take the LoRA matrices you trained for this task, calculate their product, and then add this matrix to the original weights and update the model again. The memory required to store these LoRA matrices is very small.
So in principle, you can use LoRA to train for many tasks.

Switch out the weights when you need to use them, and avoid having to store multiple full-size versions of the LLM

Soft prompts

This is a parameter efficient fine-tuning method called prompt tuning by adding additional trainable tokens to the prompt.

The goal is to help the model understand the nature of the task you’re asking it to carry out and to generate a better completion.

However, there are some limitations to prompt engineering, as it can require a lot of manual effort to write and try different prompts. You’re also limited by the length of the context window, and at the end of the day, you may still not achieve the performance you need for your task there are some limitations to prompt engineering

One potential issue to consider is the interpretability of learned virtual tokens. Remember, because the soft prompt tokens can take any value within the continuous embedding vector space. The trained tokens don’t correspond to any known token, word, or phrase in the vocabulary of the LLM. However, an analysis of the nearest neighbor tokens to the soft prompt location shows that they form tight semantic clusters.
In other words, the words closest to the soft prompt tokens have similar meanings.

The words identified usually have some meaning related to the task, suggesting that the prompts are learning word like representations.

You explored two PEFT methods in this lesson LoRA, which uses rank decomposition matrices to update the model parameters in an efficient way. And Prompt Tuning, where trainable tokens are added to your prompt and the model weights are left untouched.

Both methods enable you to fine tune models with the potential for improved performance on your tasks while using much less compute than full fine tuning methods.

LoRA is broadly used in practice because of the comparable performance to full fine tuning for many tasks and data sets,

--

--