EMNLP 2022 Tutorial — “Modular and Parameter-Efficient Fine-Tuning for NLP Models”
5 min readJun 16, 2023
- Instructor: Sebastian Ruder, Jonas Pfeiffer, Ivan Vulić
- Venue: EMNLP 2022
- Paper Links: https://aclanthology.org/2022.emnlp-tutorials.5.pdf
Abstract
As the SOTA NLP models are getting ever larger, fine-tuning the whole model consumes lots of computational resources, which causes that some with low resource cannot access the power of larger language models. Yet, the showing up of parameter-efficient fine-tuning relieves the limitation.
Parameter-Efficient Fine-Tuning (PEFT)
Compared to full fine-tuning, PEFT only update a small subset of model parameters.
Why Modularity?
- Models are increasing in size -> PEFT strategies
- Unseen scenarios -> Out-of-distribution generalization through module composition
- Catastrophic interference -> Modularity as inductive bias
- Efficient updating of models through added components
Computation Functions
Parameter Composition
- Sparse Sub-networks
- Sparsity: A common inductive bias on the module parameters.
- Pruning is the most common sparsity method
- Most common pruning criterion: weight magnitude https://arxiv.org/abs/1607.04381
2. Structured Composition
- Only modify the weights that are associated with a pre-defined group
- Example: Bias-only Fine-tuning, like BitFit Ben-Zaken et al. [2022]
3. Low-rank Composition
- Models can be optimized in a low-dimensional, randomly oriented subspace rather than the full parameter space. Li et al. [2018]
- Example: Low-rank Adaptation, like LoRA Hu et al. [2022]
Input Composition
- Prompt Tuning
- Standard prompting can be seen as finding a discrete text prompt that — when embedded using the model’s embedding layer.
- Models are sensitive to the formulation of the prompt initialization. [Webson & Pavlick, 2022]
- Prompt Tuning only works well at scale. [Mahabadi et al., 2021; Liu et al., 2022]
2. Multi-Layer Prompt Tuning
- Instead of learning parameters only at the input layer, we can learn them at every layer of the model [Li & Liang, 2021; Liu et al., 2022]
- In practice, continuous prompts are concatenated with the keys and values in the self-attention layer [Li & Liang, 2021]
Function Composition
- Adapters
- Main purpose of functions f added to a pre-trained model is to adapt it. → Functions are also known as ‘adapters’
- An adapter in a Transformer layer typically consists of a feed-forward down-projection , a feed-forward up-projection and an activation function. [Houlsby et al,. 2019]
2. Compact Adapter (Compacter)
- Compacter [Mahabadi et al., 2021] reparameterizes the W matrices in the adapter as:
- Compacter reduces adapter parameters by a factor of 10 and achieves similar or better performance
3. Sequential and Parallel Adapters
- Adapters can be routed sequentially or in parallel.
- Sequential adapters are inserted between functions:
- Parallel adapters are applied in parallel:
4. Benefits of Adapters:
- Increase model robustness. [He et al., 2021; Han et al., 2021]
- Increase sample efficiency.
Overall Comparisons
Parameter Composition
- Pros:
- Methods like diff. pruning require < 0.5% of parameters, which increases the parameter efficiency.
- Pruning does not increase the model size, which increases the inference efficiency.
- Some methods achieve strong performance.
- Subnetworks can be composed.
2. Cons:
- Pruning requires re-training iterations, which reduces the training efficiency.
Input Composition
- Pros:
- Prompts only add a small number of parameter(compared to the model size), which is beneficial to parameter efficiency.
- Continuous prompts have been composed.
2. Cons:
- Prompt tokens extend the model’s context window, which reduces the training efficiency and inference efficiency.
- Prompt tuning has a better performance on large-scaled models, however, small-sized models can not match the performance of other PEFT methods.
Function Composition
- Pros:
- Adapters do not require gradients of frozen parameters, which increase the training efficiency.
- Adapters sometimes can match or outperform standard fine-tuning.
- Adapters can be composed.
2. Cons:
- Adapters depend on the hidden size, which minors the parameter efficiency.
- New functions increase the number of operations, which influences the inference efficiency.
References
Appendix
Papers (Suggest Readings)
- [1902.00751] Parameter-Efficient Transfer Learning for NLP (arxiv.org)
- [2005.14165] Language Models are Few-Shot Learners (arxiv.org)
- [2109.01652] Finetuned Language Models Are Zero-Shot Learners (arxiv.org)
- [2101.00190] Prefix-Tuning: Optimizing Continuous Prompts for Generation (arxiv.org)
- [2109.04332] PPT: Pre-trained Prompt Tuning for Few-shot Learning (arxiv.org)
- [2110.07602] P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks (arxiv.org)
- [2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arxiv.org)
- [2205.11916] Large Language Models are Zero-Shot Reasoners (arxiv.org)