Henok Ademtew
5 min readOct 2, 2023

--

Transformer and NN

PEFT: Making Big Things Happen with Small Changes

First, what is fine-tuning? So, fine-tuning is a way that helps improve model performance by training on specific examples of prompts and desired responses. It also serves as a powerful lever for refining the abilities of LLMs. Famous models like BERT and others, begin their journey with initial training on massive datasets encompassing vast swaths of internet text. This initial training imparts them with a broad understanding of language, but it’s their fine-tuning that tailors them to excel in specific domains or tasks. We have also Smaller LLMs that have been fine-tuned on a specific use case often outperform larger ones that were trained more generally.

What is PEFT?

PEFT is like giving your pre-trained model a mini makeover rather than a complete overhaul. It’s about making the most of what you already have. By just training a small set of parameters, which could be a subset of the existing model parameters or a set of newly added parameters, PEFT seeks to tackle this issue. The parameter efficiency, memory efficiency, training speed, final model quality, and (if any) additional inference costs of these approaches vary.

Some of the challenges associated with fine-tuning large language models include the high computational cost, the large memory footprint, and the difficulty of finding optimal hyperparameters. Parameter-efficient fine-tuning (PEFT) methods aim to address these challenges by only training a small set of parameters, which reduces the computational cost and memory footprint. Additionally, some PEFT methods introduce new parameter-efficient modules or architectures that can be used to fine-tune large models more efficiently. But there are still unresolved challenges in PEFT, such as the gap between PEFT and full fine-tuning performance.

Taxonomy of PEFT

Parameter-efficient fine-tuning methods taxonomy
Parameter-efficient fine-tuning methods taxonomy (Lialin, Vladislav, et al. “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning.” ArXiv, 2023, /abs/2303.15647)

We will see the taxonomy of PEFT, as presented in the paper “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning,” which encompasses a systematic overview and comparison of 30 PEFT methods, with an in-depth discussion on 20 of them. I really recommend going over the paper to learn more. So as you can see above in the picture the taxonomy can be broadly categorized into:

1. Additive Methods

Additive methods predominantly focus on augmenting the existing pre-trained model with additional parameters or layers, training only the newly introduced parameters. This category is further bifurcated into:

  • Adapters: Introducing small fully connected networks post-Transformer sub-layers, adapters have proven to be remarkably parameter-efficient, achieving competitive performance with less than 4% of the total model parameters being tuned.
  • Soft Prompts: Soft prompts transition the challenge of finding prompts from a discrete space to a continuous optimization problem, enabling the fine-tuning of a part of the model’s input embeddings via gradient descent.

2. Selective Methods

Selective methods, one of the earliest exemplars of PEFT, involve fine-tuning only select top layers of a network or individual parameters. This category encompasses:

  • Sparse Update Methods: These methods select parameters individually for tuning but pose several engineering and efficiency challenges.
  • Fine-tuning Top Layers: Only a few top layers of a network are fine-tuned.

3. Reparametrization-based Methods

Reparametrization-based methods leverage low-rank representations to minimize the number of trainable parameters, with notable methods like:

  • Low-Rank Adaptation (LoRa): Utilizing a simple low-rank matrix decomposition to parametrize the weight update, LoRa has been effectively applied to models with up to 175 billion parameters. Read the lora paper for more, here.

4. Hybrid Methods

Hybrid methods amalgamate ideas from various categories of PEFT, such as:

  • MAM Adapter: Incorporates both Adapters and Prompt tuning.
  • UniPELT: Adds LoRa to the mixture.
  • Compacter and KronAB: Reparametrize the adapters to reduce their parameter count.
  • S4: Combines all PEFT classes to maximize accuracy at 0.5% of extra parameter count.

Other Methods

  • BitFit: A method that has been compared with popular approaches like LoRa and Adapters. A good paper to explore BitFit, here.
  • DiffPruning: Entails training a binary mask with the same number of parameters as the model, considered only storage efficient.
  • Kronecker-product Reparametrizations: Decrease the number of trainable parameters while requiring minimal extra computation.

Comparing PEFT Methods

1. Storage Efficiency

Storage efficiency pertains to the method’s ability to minimize the storage requirements for the model parameters. For instance, methods like Adapters and Soft Prompts, which fall under the additive methods category, introduce additional parameters to the network, thereby impacting storage efficiency.

2. Memory Efficiency

Memory efficiency revolves around the method’s capacity to reduce RAM usage during fine-tuning. While additive methods like Adapters introduce additional parameters, they also facilitate significant memory efficiency improvements by reducing the size of the gradients and optimizer states.

3. Computational Efficiency

Computational efficiency, particularly in terms of reducing backpropagation costs, is pivotal in ensuring that the fine-tuning process is computationally feasible. Methods like LoRa, which fall under reparametrization-based methods, offer no overhead in terms of backpropagation, thereby enhancing computational efficiency.

4. Accuracy

Accuracy is undeniably one of the most critical factors in comparing PEFT methods. It is imperative to evaluate how each method impacts the model’s performance on downstream tasks and whether the parameter efficiency is achieved at the cost of model accuracy.

5. Inference Overhead

Inference overhead pertains to any additional computational cost incurred during inference as a result of the PEFT method. For instance, additive methods like Adapters introduce extra fully connected networks, thereby incurring additional inference overhead.

PEFT Methods Under the Microscope

  • Adapters: While they introduce additional parameters (impacting storage efficiency), Adapters achieve notable memory efficiency improvements by reducing the size of gradients and optimizer states, albeit with additional inference overhead due to the extra fully connected networks.
  • LoRa: LoRa, a reparametrization-based method, employs a simple low-rank matrix decomposition to parametrize the weight update, offering computational efficiency with no additional backpropagation overhead.
  • Selective Methods: These methods, which involve fine-tuning only select top layers of a network or individual parameters, must be evaluated in terms of how the selective tuning impacts accuracy and whether it offers tangible benefits in terms of computational and memory efficiency.

I hope you’ve found this a useful place to start to break the ice with the major concepts of the PEFT. If you want to go deeper, I’d suggest you to read the following papers.

In the next blog we will use Low-Rank adapters to efficiently fine tune a large model.

See you next time, as we dive deeper into Low-Rank adapters to efficiently fine tune a large language model.

WRITER at MLearning.ai /AI Agents LLM / Good-Bad / AI ART 2024

--

--