Prompt-based Learning
A paradigm shift in Natural Language Processing
With the advent of the language model GPT-3¹ in 2020 [1], it turns out that you don’t need to be a Data Scientist anymore to create powerful applications with Natural Language Processing (NLP). Actually, you might not even need to be a programmer, at least not in the traditional sense. In this article, I will give a brief introduction to the concept of Prompt-based Learning, an NLP approach that has seen a surge of interest during 2021. I will also provide two examples of how it can be used and at the end of this post I give a glimpse of the current state of the art.
In Prompt-based Learning, the textual input data is modified by concatenating it with a carefully written text of human readable instructions. The resulting text string is the prompt. With this prompt we can make a pre-trained language model perform new tasks which it was not explicitly trained for. For example: question answering, machine translation or text generation. And this can be done simply by telling the model what to do, in plain natural language!
¹ Generative Pre-trained Transformer 3
The Prompt-based Learning method
First the input data x is passed to a prompting function which inserts x into a prompt template. The prompt template is a string with some additional information and an “unfilled slot” [Z] which is to be predicted by the model. This gives a prompt x’ which is then passed to a pre-trained language model. Given a set of pre-defined answers z and the pre-trained parameters Θ, the model predicts P(z|Θ), which is the probability that an answer should fill the slot [Z]. The most probable answer ẑ is then mapped to a class label y, giving the final predicted output ŷ. This is the most basic form of the method, as described in a research paper from 2021 where the term Prompt-based Learning was first coined [2]. Figure 1 shows the different steps of the Prompt-based method for classifying a news article into a topic.
The previous “Pre-train, Fine-tune” paradigm in NLP has been to use pre-trained language models and then fine-tune them with additional training data for downstream tasks. The focus has been on the question: “How should we fine-tune language models?”, which is a job that requires some expertise in NLP and Deep Learning. Fine-tuning is a supervised task which requires labeled data. Labeled data are datapoints that have been labeled with the correct predictions. The labeling is often done by humans. But this is a problem because labeled data might not be available to us and it could be expensive to collect. Prompt-based Learning however, essentially shifts the focus into:
“How should we design the input to language models?”
By clever design of the prompt templates, i.e. prompt engineering or prompt design, we can adapt pre-trained language models to perform new tasks without any extensive fine-tuning. This way, we circumvent the need of additional labeled data. Using a language model without fine-tuning is called Zero-/Few-Shot Learning, because we provide zero or just a few extra training datapoints to the model before using it for prediction. Smaller fine-tuned models like the T5¹ still outperforms a larger Few-shot GPT-3 on NLP benchmarks like SuperGLUE, but removing the need of extra labeled data is an advantage that could make Prompt-based Learning preferable in many applications.
¹ Text-to-Text Transfer Transformer
Prompt-based Zero-Shot Text Classification
In a paper [3] from 2021, the authors showed how a text classification task could be transformed into a Masked Language Modelling (MLM) problem. MLM is about predicting words that have been “masked” away in sentences, as described in the introduction of this article. MLM training data can be generated at a large scale in an unsupervised fashion, by programatically masking words from sentences in a large body of text. No human effort for labelling the data is needed. This makes MLM a popular pre-training objective for language models such as BERT¹. This also means that those models are already implicitly trained for our Prompt-based text classification!
Later in 2021, a Masters Thesis [4] from KTH² at RISE³ further explored this methodology and found that small variations in the design of the prompt templates could have a significant importance for the result. Just omitting a punctuation mark could give a decrease of ten percentage points in the prediction accuracy.
Notebook 1 shows how the MLM method can be used for classifying news articles into topics with a pre-trained BERT from the HuggingFace 🤗 library. This is the method described in Figure 1, except that the answer mapping step has been skipped. The notebook is hosted on both Kaggle and Deepnote. If you’re familiar with those, follow a link and try modifying the prompts or adding an answer mapper to see how it affects the classification.
¹ Bidirectional Encoder Representations from Transformers
² Research Institutes of Sweden
³ Royal Institute of Technology in Stockholm, Sweden
Prompt-based Few-Shot Text Generation with GPT-3
Here I’ve made a notebook with an example of how OpenAI’s GPT-3 API can be used to create a headline for this Medium article. The first two lines of the prompt template hold some information about the article. The learned parameters of the GPT-3 are “frozen” so we do not update the model, but we are actually passing it some additional data points, hence it’s a Few-Shot. The third line is where the article text is inserted and the bottom line is the instruction to the model:
“Suggest an attention catching headline for this article:”
I set the API to return the three best suggestions, and as you can see, the model suggests some decent headlines. If we play around more with the prompt design, the headlines will probably even pass the Turing test. The full notebook can be found on Deepnote. However, you will need to register at OpenAI and the API calls costs money. But there is a free trial so I highly recommend trying it out.
The next step: Prompt Tuning
Now it would be interesting to investigate if there are any systematic methods to find the best prompt templates. It turns out there is something called Prompt Tuning. In this technique the prompts have learnable parameters, which are learned by the model itself through the classic Deep Learning approach of Backpropagation. A paper from 2021 [5] showed that this method outperforms basic prompt engineering and as the size (number of parameters) of the language model increases, the performance also catches up with regular fine-tuning. Prompt Tuning is a highly interesting topic that could be explored in a future article here.
Author
Vilhelm Gustavsson is an MSc student in Machine Learning at the KTH Royal Institute of Technology in Stockholm.
References
[1] Brown, T. B. et al. “Language models are few-shot learners” (2020). https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[2] Liu, Pengfei et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.” (2021). https://arxiv.org/pdf/2107.13586.pdf
[3] Schick, Timo; Schütze, Hinrich. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference” (2021). https://aclanthology.org/2021.eacl-main.20.pdf
[4] Åslund, Jacob. “Zero/Few-Shot Text Classification: A Study of Practical Aspects and Applications.” (2021). https://www.diva-portal.org/smash/get/diva2:1613200/FULLTEXT01.pdf
[5] Lester, Brian. et al. “The Power of Scale for Parameter-Efficient Prompt Tuning” (2021). https://aclanthology.org/2021.emnlp-main.243.pdf