Alternative for fine-tuning? Prefix-tuning may be your answer!

6 min readFeb 18, 2022

Summary for Publication Prefix-Tuning: Optimizing Continuous Prompts for Generation

Fine-tuning illustration from Prefix-tuning

Fine-tuning has been the paradigm for pre-trained language models (LM) to adapt to downstream tasks. It is a simple method to train the language models on the downstream datasets and able to obtain good results. Which came to my thought, is fine-tuning the sole way to learn with language models? A recent paper accepted in ACL 2021 Prefix-Tuning: Optimizing Continuous Prompts for Generation answered this question with a new concept. This article serves to introduce interesting topics and a summary for this paper. Feel free to discuss with me or correct my mistakes ~

What can we improve from fine-tuning LMs ? Space and Data Efficiency
Fine-tuning language models to downstream tasks require updating the whole parameters of the language models. In which, if we need to save the new model, it will cost us the space of the pre-trained LMs that can be gigantic. Recent years, alternative methods has been introduced to solve this problem. One example method is shown below.

Adapter-tuning: This is a method to adapt to fine-tune LMs by adding task-specific layers into the LM. In the mean time, keeping the LM parameters frozen. These task-specific layers can be feed forward networks that only cost approximately 2% of the original LM. This method can be space-efficient and also capable achieving similar performance to fine-tuning.

Adapter illustration from Parameter-Efficient Transfer Learning for NLP

Hence, keeping the original parameters of the LM can solve the problem of wasting storage on different downstream tasks. However, to achieve good performance on fine-tuning would also need sufficient data for the downstream tasks, which questions the capability for fine-tuning to perform at low resource scenarios.

Prompting: GPT-3 is language model trained on extremely large amounts of corpus. It can achieve few-shot learning by only giving it natural language instructions and examples, and it will generate the output for the down-stream task. In which, in low resources it is able to generate coherent output unlike fine-tuning. Detailed visualization can be shown in https://jalammar.github.io/how-gpt3-works-visualizations-animations/.

Illustration from https://jalammar.github.io/

Motivation for Prefix-tuning

If we would like to steer the language model to generate a certain target word, prepending certain phrases may increase the conditional probability for the corresponding output. In the example below, if we would like the language model to generate the word "Hogwarts", we may prepend a certain phrase like "Harry Potter graduated from" as a prefix to increase the probability. Hence, with the proper context (prompt) prepended, it would be likely that we can steer the LM towards learning a downstream task.

Here, the authors thought that the best way to optimize the proper context is to define the prompts as continuous vectors instead of discrete words or word embeddings. As a result, the goal to adapt to downstream tasks would be finding a way to optimize these continuous prompts.

Optimizing the Prompts

Prefix-tuning illustration from Prefix-tuning

For an auto-regressive model like GPT2, with prefix tuning, we will prepend trainable prefixes (continuous prompts) in front of x, y and obtain activations h1 and h2. Here, the prefix-length is defined as 2. And we will obtain the trainable p theta matrix to store the parameters with dimension of prefix times the dimension of the activation vector. And for the other indices, the activations will be the same as the regular fine-tuning model with parameters phi.

For an autoencoder model like BART, will will need to prepend the prefixes in both encoder and decoders. We can think of adding the prefix in the encoder to guide information extraction, and added in the decoder to guide text generation.

Prefix tuning will have a similar loss as fine-tuning and it only need to tune the theta parameters for the prefix matrix while the phi parameters belongs to the original language model that doesn’t need to be tuned. So the training objective is that we only need to optimize the prefix !

Experiments Settings

The authors of this paper chose table-to-text and summarization to test the capability of prefix-tuning in text generation tasks. For table-to-text, they tested with base model as GPT-2, and BART for summarization.

Table-to-text Datasets: E2E, DART, WebNLG
Summarization Dataset: XSUM

Main Results

Table-to-text results: The evaluation metrics except TER indicates that the higher the better. Prefix-tuning only needs to store an additional 0.1% of parameters while maintaining competitive performance, even surpassing the state of the art results.

Summarization results: The evaluation metrics for summarization is ROUGE score, with higher scores also indicate the better the performance is. Although in this scenario, increasing the prefix parameters to 2% of the whole LM still underperforms fine-tuning. The reasons to suspect is as follows:

Dataset more complex: Comparing table-to-text with summarization, summarization is more complex and difficult to learn. Hence, the prompts may have limited expressiveness to guide the LM to learn this task.
Document length 17x longer: Increasing the length of the prefix length may also limit the length of the input. Summarization datasets with longer input lengths may also limit the expressiveness.

Low Data Scenario Results

Based on the results for summarization and table-to-text, one big difference between is the dataset sizes. The authors subsample E2E for table-to-text and XSUM for summarization for a smaller dataset. The right top part is the results for table-to-text.

Observing the generated examples on the left hand side, although both methods tend to under-generate because of small dataset size, prefix-tuning is still more faithful than fine-tuning as we can see in FT (100,200) , the fine-tuning method claims that customer ratings are low, but the input text states that the customer rating is average.

Extrapolation Results

The authors also investigated in extrapolation to unseen topics for both table-to-text and summarization. They split the training and testing sets with different topics. For example, they may train on an art category and then tested on architecture category. So there are some unique nouns or predicates to each particular category so that the extrapolation task matters to test generalization. The results S stands for training on seen topics and testing on seen topics. We need to observe the column U where they tested on unseen topics.

For example, they may train on an art category and then tested on architecture category.

Conclusion and Discussion

A speculation about the superior performance from prefix-tuning is that preserving the language model parameters is important to generalize on further downstream tasks. The pre-trained language model contains extensive knowledge from different corpuses. These knowledge is beneficial to understand the downstream tasks hence reducing the need for extra parameters and boosting performance.

Prompt-tuning may have potential for further improvement for multi-task learning and improving on low resource or complex tasks. It is really exciting to see a change to utilize the powerful language models to more use!

Alternative for fine-tuning? Prefix-tuning may be your answer!

References

Written by Vincent Chen