Multitask Prompted Learning

How large language models are trained?

4 min readFeb 15, 2022

Background

In the modern NLP field, it’s all about transfer learning. The advantage of current neural network based models is scalability, meaning that we can simply train a larger model on a larger dataset. Thankfully, we have a mature self-supervised learning framework and the text data is beyond abundant on the internet — — for example, the Common Crawl project produces about 20TB of text data extracted from web pages each month.

Thus, in recent years, NLP researchers focuses on developing transfer learning methodology. It all starts with this paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — link.

To be very brief, it’s a unifying framework that forms every text processing problem as a “text-to-text” problem: give a sequence of text as input, the model outputs a sequence of text. It allows the application of the same model, objective, training procedure, and decoding process to every common NLP task.(including translation, question answering, classification, etc).

T5 stands for “Text To Text Transfer Transformer”, link

Prompt

The development of text-to-text pre-trained models such as T5 makes prompts a particularly useful method for multitask learning. As shown above, we first get datasets from different tasks and then add some instructional descriptions before the raw training set. The pre-processed training data is called prompt.

What matters is that the input is human-readable(!) and we come a long way. The current large NLP models read like humans and answer like humans. Consider the large pre-trained model as the baby Yoda: it’s very powerful W.R.T to its general-purpose knowledge contained inside the model, but we don’t know what kind of behavior (prompt) can trigger the superpower in the downstream task.

For each kind of dataset (or a category of tasks), it chooses a prompt template to transfer training data to prompts.

Zero-shot Text Generalization

Based on the previous work, the paper: Multitask Prompted Training Enables Zero-Shot Task Generalization — link, focuses on explicitly training language models in a supervised and massively multitask fashion. It tries to answer these two questions:

Does multitask prompted training improve generalization to unseen tasks?
Does training on a wider range of prompts improve robustness to prompt wording?

It conducts experiments in this fashion: group datasets by tasks and train on some groups and test the model on the held-out datasets. By making sure that data leak does not happen, it’s able to test the ability of zero-shot generalization.

What is a “task”?

We use the term “task” to refer to a general NLP ability that is tested by a group of specific datasets. To evaluate zero-shot generalization to new tasks, we train on a subset of tasks and evaluate on a held-out group of tasks.
— link

Yellow datasets are in the training mixture. Green datasets are held out and represent tasks that were not seen during training. Zero-shot task generalization experiments are evaluated on green datasets. — link

Model Training Details

Keywords: encoder-decoder arichitecture/autoregressive/maximum likelihood training

All models we trained are based on T5, a Transformer-based encoder-decoder language model pre-trained with a masked language modeling-style objective on 1T tokens from C4 (Raffel et al., 2020). Since T5’s pretraining objective involves filling in tokens from the input text that has been removed, it is quite different from the conditional text generation format used in our prompted datasets. We therefore use the publicly available LM-adapted T5 model from Lester et al. (2021) (referred to as T5+LM), which was produced by training T5 on 100B additional tokens from C4 on a standard language modeling objective. Unless specified otherwise, we use the XXL version which has 11B parameters.

(To be continued on Performance)

Reference

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — https://arxiv.org/abs/1910.10683
Multitask Prompted Training Enables Zero-Shot Task Generalization — https://arxiv.org/abs/2110.08207
The Power of Scale for Parameter-Efficient Prompt Tuning — https://arxiv.org/abs/2104.08691

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

🟠 Become a Writer at MLearning.ai