EMNLP 2022 Tutorial — “Modular and Parameter-Efficient Fine-Tuning for NLP Models”

Oscar Shih
5 min readJun 16, 2023

Abstract

As the SOTA NLP models are getting ever larger, fine-tuning the whole model consumes lots of computational resources, which causes that some with low resource cannot access the power of larger language models. Yet, the showing up of parameter-efficient fine-tuning relieves the limitation.

[2209.00099] Efficient Methods for Natural Language Processing: A Survey (arxiv.org)

Parameter-Efficient Fine-Tuning (PEFT)

Compared to full fine-tuning, PEFT only update a small subset of model parameters.

Why Modularity?

  • Models are increasing in size -> PEFT strategies
  • Unseen scenarios -> Out-of-distribution generalization through module composition
  • Catastrophic interference -> Modularity as inductive bias
  • Efficient updating of models through added components

Computation Functions

Parameter Composition

  1. Sparse Sub-networks
  • Sparsity: A common inductive bias on the module parameters.
  • Pruning is the most common sparsity method
  • Most common pruning criterion: weight magnitude https://arxiv.org/abs/1607.04381
During pruning, a fraction of the lowest-magnitude weights are removed, and the non-pruned weights are re-trained. Also, pruning for multiple iterations is more common.[Frankle & Carbin, 2019]

2. Structured Composition

Bias-only Fine-tuning

3. Low-rank Composition

LoRA Model Architecture

Input Composition

  1. Prompt Tuning
  • Standard prompting can be seen as finding a discrete text prompt that — when embedded using the model’s embedding layer.
  • Models are sensitive to the formulation of the prompt initialization. [Webson & Pavlick, 2022]
Fine-tuning vs Prompt tuning [Li & Liang, 2021]
Prompt tuning vs standard fine-tuning and prompt design across T5 models of different sizes. [Lester et al., 2021]

2. Multi-Layer Prompt Tuning

Left: Prompt Tuning; Right: Multi-layer Prompt Tuning
  • In practice, continuous prompts are concatenated with the keys and values in the self-attention layer [Li & Liang, 2021]

Function Composition

  1. Adapters
  • Main purpose of functions f added to a pre-trained model is to adapt it. → Functions are also known as ‘adapters’
  • An adapter in a Transformer layer typically consists of a feed-forward down-projection , a feed-forward up-projection and an activation function. [Houlsby et al,. 2019]
Adapters in Transformer-based model. [Houlsby et al,. 2019]

2. Compact Adapter (Compacter)

  • Compacter reduces adapter parameters by a factor of 10 and achieves similar or better performance

3. Sequential and Parallel Adapters

  • Adapters can be routed sequentially or in parallel.
  • Sequential adapters are inserted between functions:
  • Parallel adapters are applied in parallel:
Two parallel adapters [Stickland & Murray, 2019]
A sequential adapter [Houlsby et al., 2019]

4. Benefits of Adapters:

BERT test performance distributions over 20 runs with different learning rates [He et al., 2021]
  • Increase sample efficiency.
Results on GLUE with different numbers of training samples per task [Mahabadi et al., 2021]

Overall Comparisons

Parameter Composition

  1. Pros:
  • Methods like diff. pruning require < 0.5% of parameters, which increases the parameter efficiency.
  • Pruning does not increase the model size, which increases the inference efficiency.
  • Some methods achieve strong performance.
  • Subnetworks can be composed.

2. Cons:

  • Pruning requires re-training iterations, which reduces the training efficiency.

Input Composition

  1. Pros:
  • Prompts only add a small number of parameter(compared to the model size), which is beneficial to parameter efficiency.
  • Continuous prompts have been composed.

2. Cons:

  • Prompt tokens extend the model’s context window, which reduces the training efficiency and inference efficiency.
  • Prompt tuning has a better performance on large-scaled models, however, small-sized models can not match the performance of other PEFT methods.

Function Composition

  1. Pros:
  • Adapters do not require gradients of frozen parameters, which increase the training efficiency.
  • Adapters sometimes can match or outperform standard fine-tuning.
  • Adapters can be composed.

2. Cons:

  • Adapters depend on the hidden size, which minors the parameter efficiency.
  • New functions increase the number of operations, which influences the inference efficiency.
Overall comparison of all three composition methods.

References

Appendix

Papers (Suggest Readings)

Useful Links

--

--