Comparison between BERT, GPT-2 and ELMo

Gaurav Ghati
7 min readMay 3, 2020

--

The recent progress in NLP in terms of model architecture had led us to breakthrough ideas like BERT architecture. Among those ideas, the few which changed the way we implement further models are large-scale pre-trained language models like OpenAI GPT, BERT and deep contextualized word representations i.e. ELMo.

(Image Source: From Unsplash by Aaron Burden)

In this article, we will look at how various model approaches were proposed:

  1. ELMo
    -
    Bidirectional Language Model
    - ELMo Representations
    - ELMo for Downstream Tasks
  2. OpenAI GPT-2
    -
    Transformer Decoder as Language Model
    - Supervised Fine-Tuning
  3. BERT
    -
    Pre-training
    - Input Representation
    - BERT in Downstream Tasks

4. Comparison of GPT-2, BERT and ELMo

ELMo

ELMo stands for Embeddings from Language Model, as the name suggests in this models the deeply contextualized word embeddings are created from the Language Models (LM).

ELMo uses bidirectional language model (biLM) which is pre-trained on a large text corpus, to learn both words (e.g., syntax and semantics) and linguistic context (i.e., to model polysemy). BiLM capture context-dependent aspects of word meaning.

BiLM based on ELMo, Figure: From “Neural Networks, Types, and Functional Programming” by Christopher Olah

Bidirectional Language Model
Given a sequence of n tokens, (x₁, x₂, …, xₙ), a forward language model computes the probability of the sequence by modeling the probability of token tₖ given the history (x₁, x₂, …,xk-1):

In the forward pass, the history contains words before the target token,

In the backward pass, the history contains words after the target token,

A biLM combines both a forward and backward LM, and train the model to maximize the probability of likelihood i.e. minimize the negative log-likelihood.

ELMo Representation
each token xₖ, an L-layer biLM computes a set of 2L + 1 representations

Outputs from both LSTM are concatenated together to form Hi,l

The weights, sᵗᵃˢᵏ, in the linear combination, are learned for each end task and normalized by softmax. The scaling factor γᵗᵃˢᵏ is used to correct the misalignment between the distribution of biLM hidden states and the distribution of task-specific representations. Both sᵗᵃˢᵏ and γᵗᵃˢᵏ are learnable parameters in LM.

Here, Vi= ELMoᵗᵃˢᵏ = Task-specific weighting

ELMo is applied on semantic-intensive and syntax-intensive tasks respectively using representations in different layers of biLM

  • For a semantic-intensive task, the top layer is better than the first layer.
  • And for a syntax-intensive task, the first layer is better than top layers.

ELMo for Downstream Tasks
This representation can be easily added to existing models and significantly improve the state-of-the-art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis.

OpenAI GPT-2

The OpenAI GPT-2 is the successor of the GPT model. GPT-2 is a large transformer-based language model, with generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.

GPT has two major differences from ELMo:

  1. The model architecture: ELMo uses the concatenation of forward and backward LSTMs, but GPT uses multi-layer transformers decoder.
  2. Contextualized embedding: ELMo uses unsupervised Feature-based approach, while GPT fine-tunes the same base model for all end tasks.

Transformer Decoder as Language Model
Unlike original transformer architecture, the transformer decoder model discards the encoder part, so there is only one single input sentence rather than two separate source and target sequences.

Figure: From Lil’Log by Lilian.

Transformer block contains a masked multi-headed self-attention followed by pointwise feed-forward layer and normalization layers in between. The final output produces a distribution over target tokens after softmax.

The likelihood probability of the next word, same as ELMo, but without backward computation.

Supervised Fine-Tuning
After training the model, we use the parameters for the supervised task. We assume a labelled dataset, where each instance consists of a sequence of input tokens(x₁, x₂, …, xₙ), and an output label y for it.

This transformer model (GPT) can be used for fine-tuning on different tasks. First convert all structured inputs into token sequences to be processed by pre-trained model, followed by a linear+softmax layer.

The basic model architecture of various NLP tasks is shown below.

(Image source: original paper)

At the first stage, generative pre-training of a language model can absorb as much free text as possible (unsupervised learning). Then at the second stage, the model is fine-tuned on specific tasks with a small labelled dataset and a minimal set of new parameters to learn, This method is semi-supervised sequence learning (original paper).

Drawbacks: GPT is its uni-directional nature — the model is only trained to predict the future left-to-right context.

BERT

BERT stands for Bidirectional Encoder Representations from Transformers, as the name suggests this model is based on bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers. As a result, BERT is one of the most breakthroughs ideas in the last few years.

Compared to GPT, the largest difference and improvement of BERT is to make training bi-directional. The paper claim that:

“bidirectional nature of our model is the single most important new contribution”

(Image source: BERT: original paper)

Pre-Training BERT
Pre-training BERT uses two unsupervised tasks, that are Masked LM and Next Sentence Prediction (NSP) to train.

Task 1: Masked Language Model (MLM)
Learning the context around a word rather than learning just after the word makes it able to better capture its meaning, both syntactically and semantically, you can read more about this cloze procedure in this paper.

The training data generator chooses 15% of the token positions at random for prediction. If the iᵗʰ token is chosen, we replace the iᵗʰ token with

1. The [MASK] token 80% of the time
2. A random token 10% of the time
3. The unchanged iᵗʰ token 10% of the time
Ti will be used to predict the original token with cross-entropy loss

Task 2: Next Sentence Prediction (NSP)
Many important downstream tasks such as Question Answering (QA) are based on the relationship between two sentences, which is not directly captured by language modeling.

Training a binary classifier for telling whether one sentence is the next sentence of the other, this pre-training will improve efficiency in tasks like QA and NLI.

Input Representation
BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

(Image source: BERT: original paper)

Positional embeddings help to store position-related information in whole sequence and segment embedding stores position with respect to sentences.

BERT in Downstream Tasks
This model is trained on huge corpus i.e. English Wikipedia (2,500M words) and BooksCorpus (800M words) which makes it a state-of-the-art model.

Various Architecture of BERT model for different NLP tasks, (Image source: BERT original paper)

BERT performs state-of-the-art results in many NLP task such as :

  • Multi-Genre Natural Language Inference (MNLI)
  • Quora Question Pairs (QQP)
  • Question Natural Language Inference (QNLI)
  • The Stanford Sentiment Treebank (SST-2)
  • The Corpus of Linguistic Acceptability (CoLA)
  • The Semantic Textual Similarity Benchmark (STS-B)
  • Microsoft Research Paraphrase Corpus (MRPC)
  • Recognizing Textual Entailment (RTE) etc.

For more info of every task, you can view Appendix section of the original paper

Comparison of BERT, GPT-2 and ELMo

The comparisons between the model architectures are shown visually below. Note that in addition to the architecture differences, BERT and OpenAI GPT are finetuning approaches, while ELMo is a feature-based approach.

Comparison of BERT, OpenAI GPT and ELMo, (Image source: BERT original paper)
  • BERT and GPT are transformer-based architecture while ELMo is Bi-LSTM Language model.
  • BERT is purely Bi-directional, GPT is unidirectional and ELMo is semi-bidirectional.
  • GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
  • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.
  • GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.
  • GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.
Summary of various models, Figure: From Lil’Log by Lilian.

References:

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: (original paper)
  • Generalized Language Models: (Original post) by Lilian Wang.
  • Deep contextualized word representations: ELMo (original paper)
  • Improving Language Understanding with Unsupervised Learning: GPT (Original Paper)
  • Language Models are Unsupervised Multitask Learners: GPT-2 (Original paper)

Thanks for reading, You can connect with me on LinkedIn, Twitter or my portfolio

--

--

Gaurav Ghati

I'm a CS graduate student at University of California, Irvine.