Summary of Text To Text Transfer Transformer — T5

A brief overview of Google T5 transformer

Gundluru Chadrasekhar
Scavs.ai
3 min readSep 1, 2020

--

The basic approach behind Text to text transfer transformer is to take every NLP problem as the TEXT — TEXT approach similar to the Sequence -sequence model.

Text-Text framework:

T5 uses the same model for all various tasks by the way we tell the model which task to perform by prepending the task prefix which is also a text.

As shown in the above picture if we want to use T5 for the classification task of predicting sentence grammatically correct or not, adding the prefix “Cola sentence: ”will take care of it and return two texts as output ‘acceptable’ or ‘not acceptable’

Interestingly T5 also perform two sentences similarity regression task in Text- text framework. They posed this as a classification problem with 21 classes (from 1–5 with 0.2 increments eg:’1.0’,’1.2’,’1.4’……’5.0’) and asked the model to predict a string and T5 gave SOTA results to this task too.

Similarly for other tasks ‘summarize:’ which return a summary of the article and for NMT ‘translate English to german:’

T5 Pretraining and Finetuning:

Q) Whats new in T5?

Ans) Nothing

yeah its true, T5 Uses vanilla Transformer Architecture. Then how they got SOTA results? The main motivation behind T5 work is ….

Given the current landscape of transferlearning for NLP what works best and how far we can push the tools we have? Search Results — Colin Raffel

T5 base with Bert base sized encoder-decoder stacks with 220 million parameters experimented with an all wide variety of NLP techniques during pretraining and fine-tuning. To know all experiments they perform on T5base check out this paper.

summing-up Best outcomes from T5 experiments :

  1. Used Large Dataset for Pre-training: An important ingredient for transfer learning is the unlabeled dataset used for pre-training .T5 uses common crawl web extract text (C4) which results in 800 GB of data after cleaning and deduplication of data. The cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content.
  • Architectures: Experimented with encoder-decoder models and decoder-only language models similar to GPTand found encoder-decoder models did well
  • un-supervised objectives: T5 Uses MLM-Masked Language Modeling as pertaining objective and it worked best for then they also experimented with Permutation Language modeling where XLNET uses this as un-supervised objectives.;

Finally,

Insights + Scale = State-of-the-Art

T5 further explores with scaling their models large, with dmodel = 1024, a 24 layer encoder and decoder, and dkv = 128. T5–3Billion variant uses dff = 16,384 and 32-headed attention, which results in around 2.8 billion parameters;

for T5 -11Billion has dff = 65,536 and 128-headed attention producing a model with about 11 billion parameters.

T5 the largest model had 11 billion parameters and achieved SOTA on the GLUE, SuperGLUE, SQuAD, and CNN/Daily Mail benchmarks. One particularly exciting result was that T5 achieved a near-human score on the SuperGLUE natural language understanding benchmark, which was specifically designed to be difficult for machine learning models but easy for humans.

T5 Code walkthrough:

Checkout for abstractive summarization with T5

A blog for BERTSUM and BERTSUMABS

References:

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

--

--