EMNLP 2022 Tutorial — “Modular and Parameter-Efficient Fine-Tuning for NLP Models”

5 min readJun 16, 2023

Instructor: Sebastian Ruder, Jonas Pfeiffer, Ivan Vulić
Venue: EMNLP 2022
Paper Links: https://aclanthology.org/2022.emnlp-tutorials.5.pdf

Abstract

As the SOTA NLP models are getting ever larger, fine-tuning the whole model consumes lots of computational resources, which causes that some with low resource cannot access the power of larger language models. Yet, the showing up of parameter-efficient fine-tuning relieves the limitation.

[2209.00099] Efficient Methods for Natural Language Processing: A Survey (arxiv.org)

Parameter-Efficient Fine-Tuning (PEFT)

Compared to full fine-tuning, PEFT only update a small subset of model parameters.

Why Modularity?

Models are increasing in size -> PEFT strategies
Unseen scenarios -> Out-of-distribution generalization through module composition
Catastrophic interference -> Modularity as inductive bias
Efficient updating of models through added components

Computation Functions

Parameter Composition

Sparse Sub-networks

Sparsity: A common inductive bias on the module parameters.
Pruning is the most common sparsity method
Most common pruning criterion: weight magnitude https://arxiv.org/abs/1607.04381

During pruning, a fraction of the lowest-magnitude weights are removed, and the non-pruned weights are re-trained. Also, pruning for multiple iterations is more common.[Frankle & Carbin, 2019]

2. Structured Composition

Only modify the weights that are associated with a pre-defined group
Example: Bias-only Fine-tuning, like BitFit Ben-Zaken et al. [2022]

3. Low-rank Composition

Models can be optimized in a low-dimensional, randomly oriented subspace rather than the full parameter space. Li et al. [2018]
Example: Low-rank Adaptation, like LoRA Hu et al. [2022]

Input Composition

Prompt Tuning

Standard prompting can be seen as finding a discrete text prompt that — when embedded using the model’s embedding layer.
Models are sensitive to the formulation of the prompt initialization. [Webson & Pavlick, 2022]

Fine-tuning vs Prompt tuning [Li & Liang, 2021]

Prompt Tuning only works well at scale. [Mahabadi et al., 2021 ; Liu et al., 2022]

Prompt tuning vs standard fine-tuning and prompt design across T5 models of different sizes. [Lester et al., 2021]

2. Multi-Layer Prompt Tuning

Instead of learning parameters only at the input layer, we can learn them at every layer of the model [Li & Liang, 2021 ; Liu et al., 2022]

Left: Prompt Tuning; Right: Multi-layer Prompt Tuning

In practice, continuous prompts are concatenated with the keys and values in the self-attention layer [Li & Liang, 2021]

Function Composition

Adapters

Main purpose of functions f added to a pre-trained model is to adapt it. → Functions are also known as ‘adapters’
An adapter in a Transformer layer typically consists of a feed-forward down-projection , a feed-forward up-projection and an activation function. [Houlsby et al,. 2019]

Adapters in Transformer-based model. [Houlsby et al,. 2019]

2. Compact Adapter (Compacter)

Compacter [Mahabadi et al., 2021] reparameterizes the W matrices in the adapter as:

Compacter reduces adapter parameters by a factor of 10 and achieves similar or better performance

3. Sequential and Parallel Adapters

Adapters can be routed sequentially or in parallel.
Sequential adapters are inserted between functions:

Parallel adapters are applied in parallel:

Two parallel adapters [Stickland & Murray, 2019]

A sequential adapter [Houlsby et al., 2019]

4. Benefits of Adapters:

Increase model robustness. [He et al., 2021; Han et al., 2021]

BERT test performance distributions over 20 runs with different learning rates [He et al., 2021]

Increase sample efficiency.

Results on GLUE with different numbers of training samples per task [Mahabadi et al., 2021]

Overall Comparisons

Parameter Composition

Pros:

Methods like diff. pruning require < 0.5% of parameters, which increases the parameter efficiency.
Pruning does not increase the model size, which increases the inference efficiency.
Some methods achieve strong performance.
Subnetworks can be composed.

2. Cons:

Pruning requires re-training iterations, which reduces the training efficiency.

Input Composition

Pros:

Prompts only add a small number of parameter(compared to the model size), which is beneficial to parameter efficiency.
Continuous prompts have been composed.

2. Cons:

Prompt tokens extend the model’s context window, which reduces the training efficiency and inference efficiency.
Prompt tuning has a better performance on large-scaled models, however, small-sized models can not match the performance of other PEFT methods.

Function Composition

Pros:

Adapters do not require gradients of frozen parameters, which increase the training efficiency.
Adapters sometimes can match or outperform standard fine-tuning.
Adapters can be composed.

2. Cons:

Adapters depend on the hidden size, which minors the parameter efficiency.
New functions increase the number of operations, which influences the inference efficiency.