[Part-1] Which Attention(architecture) do you need?

Published in

ETHER Labs

6 min readAug 5, 2019

When we started working on text tasks at Ether Labs, Neural Transfer Learning on the text was still in its nascent stages and we relied on multiple training paradigms for different tasks — Summarization, text similarity, key-phrase extraction, NER etc. The source data being same, this looked redundant and we have played around with approaches to build reusable feature extractor that can be used across multiple tasks. Word/paragraph embeddings are flexible enough to address some of these tasks but their learning capabilities are no match to the deeper architectures. Task-specific Language model fine-tuning has brought in the ImageNet like fine-tuning capabilities to text and paving way for general text understanding. We were quick to adopt this approach, thanks to Howard and Ruder, 2018, we have achieved significant improvements across multiple tasks.

Our natural next step in adopting cutting edge learning paradigms is Transformer based architectures. The improvements are evident, and there were no second thoughts about replacing our existing Language Model. Of late, Google, Microsoft and Open AI are pushing the performance limits on GLUE tasks with their transformer-based learning paradigms. In this post and the next, we highlight the differences between recently proposed transformer-based architectures and enable NLP practitioners to choose the right architecture for their tasks.

Introduction

Transformer aka. self-attention architectures, proposed by Vaswani et al. in “Attention is All you need” is attention only architectures to draw global dependencies between input and output. This Self-attention/intra-attention mechanism works by relating different positions of a single sequence in order to compute a representation of the sequence. For a detailed explanation on transformers refer to The Illustrated Transformer, The Annotated Transformer.

[fig-1] Transformer-based models segregated by their architectural design

Above snapshot shows some of the recently proposed transformer-based architectures. While at their core, every architecture based on multi-headed self-attention, they are different in the following aspects:

Left-to-right vs. bi-directional contexts
Pre-training tasks
Usage of auxiliary tasks
Generalization strategies adopted

The models in the same quadrant are not necessarily similar in all aspects — they can be similar in the way they are different from models in other quadrants. For example, GPT-2 is architecturally close to GPT and is similar to XLNet for its emphasis on generalization. This is the best possible grouping I could think of at this point.

In the literature, the Left-to-right context networks are called “Transformer Decoders” and bi-directional context networks are called “Transformer Encoder”

GPT and GPT-2

GPT/GPT-2 architecture is a decoder only architecture where every token can attend only attend the tokens to its left in the self-attention layers. The pre-training is based on typical language modelling setup — likelihood maximum of token sequences using multi-layer Transformer decoder

Given an unsupervised corpus of tokens U = {u1, . . . , un}, we use a standard language modeling objective to maximize the following likelihood

The pre-training parameters can be adapted to the supervised tasks by feeding the inputs through the pre-trained model to obtain the final transformer block’s activation, which is then fed into an added linear output layer to predict the label. Including language modelling as an auxiliary objective to the fine-tuning helped in learning by improving generalization of supervised tasks and faster convergence.

Architecturally, GPT-2 remain the same with the following modifications:

Position of layer-normalization
Modified initialization
Expanded vocabulary and increased context and batch sizes

Usability

GPT architecture is effective in featurizing longer text sequences due to its Language model training objective. This can be handy for generating paragraph/document level embeddings.

Limitations

GPT and GPT-2 fail to capture the full representation of a word due to its usage of left-to-right only context. This yields sub-optimal performance on the token-level tasks such as question answering where bi-directional context is required.

BERT and MT-DNN

BERT — Bidirectional Encoder Representations from Transformers has pushed the SOTA across various text tasks by addressing the limitations of GPT by going bidirectional and using two novel pre-training tasks:

Masked LM — Randomly mask some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Masked LM has is not restricted to left-to-right context as the masked token is predicted based on all the unmasked tokens irrespective of their position in the sentence
Next Sentence Prediction — A binary classifier based fine-tuning to predict whether a sentence pair can occur in a sequence. This fine-tuning helps to address NLP tasks involving two or more inputs — Question Answering, Natural Language Inference etc.

Task-specific fine-tuning is same as in GPT with minor modifications.

[fig-2] BERT Pre-training input for Masked LM + NSP

MT-DNN — Multi-Task Deep Neural Networks takes inspiration from BERT for pre-training and uses novel Multi-task learning approach for fine-tuning.

Multi-Task Learning (MTL) is inspired by human learning activities where people often apply the knowledge learned from previous tasks to help learn a new task. For example, it is easier for a person who knows how to ski to learn skating than the one who does not. Similarly, it is useful for multiple (related) tasks to be learned jointly so that the knowledge learned in one task can benefit other tasks.

[fig-3] Overview of MT-DNN Learning strategy

[fig-3] shows the learning strategy of MT-DNN — during fine-tuning each batch uses one of the pre-selected objective functions to optimize on instead of one. The performance gains for MT-DNN come from this fine-tuning strategy which gives MT-DNN ability to use learnings from one fine-tuning task in others, unlike GPT and BERT where task-specific fine-tuning independent of other tasks. This generalized learning strategy aligns with the task agnostic learning paradigm. Please refer to this link for more details

MAML trains over a wide range of tasks. It trains for a representation that can be quickly adapted to a new task, via a few gradient steps. The meta-learner seeks to find an initialization that is not only useful for adapting to various problems but also can be adapted quickly (in a small number of steps) and efficiently (using only a few examples).

Usability

BERT addresses the limitations of GPT using both left-to-right and right-to-left context for learning token representations in a sentence. This makes BERT efficient for most text tasks — Q&A, sentence similarity, POS tagging etc.

Limitations

BERT suffers from a training-inference mismatch as the [MASK] token used during Masked LM training does not appear during inference. This is alleviated to some extent using the following strategy:

A downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross entropy loss.

One more constraint on learning capabilities of BERT is that it assumes conditional independence between the masked tokens and this limits the quality of sentence representations learnt by BERT.
BERT is not effective for encoding longer text sequence as the pre-training tasks are restricted to the sentence level tasks.

Contd.

In Part-2, we will cover rest of the Transformer based architectures from [fig-1] — discuss how they address the limitations of GPT and BERT based approaches, the path forward for generalized text understanding and lot more.

Checkout EtherMeet, an AI-enabled video conferencing service for teams who use Slack.
Sign up at etherlabs.io

[Part-1] Which Attention(architecture) do you need?

Introduction

GPT and GPT-2

Usability

Limitations

BERT and MT-DNN

Usability

Limitations

Contd.

Written by Venkata Dikshit