Introduction to NLP Transformers and Spacy

13 min readDec 15, 2019

Hi!

This article is based on a presentation I gave for my work colleagues back in August 2019. Things are moving very fast in this space at the moment and I can’t guarantee that I’ve been able to keep up to date on all the goings on in the transformers world, but I believe the contents in this article is still valid and useful, especially for those just getting started with transformers.

What this article will cover:

‘Attention is all you need’
Talk a look at some of the latest models to be released (BERT, XLNET, RoBERTa)
Look into Spacy’s addition of these models for fine-tuning

Attention is all you need

Transformers were largely introduced and strongly argued for in the paper “Attention is all you need”. The paper addressed known structural disadvantages that recurrent neural networks (RNNs) possessed when attempting to model natural language. This can be summarised as the following:

RNNs have to be greatly modified in order to take the full context of a sequence into account in order to perform well on most NLP tasks. That is the say that, special memory management techniques such as LSTM’s have come into existence solely because of an inherent weakness of the Recurrent model for modelling natural language.
RNNs will never be able to take advantage of parallel computing because of the sequential nature of the networks. It is impractical to segment a RNN’s computation and processes them independently because one network is dependant on the output of the previous network. Since RNNs can’t take full advantage of innovations in parallel computing, this becomes a bottleneck to progress as it is often the case now that SOTA performance is often a function of available data and compute power.

This paper proposed a new category of network that is “attention” based. This network has become known as the Transformer model.

The Transformer

The above image is the transformer model in its most basic state. For an excellent explanation of this architecture and the paper, ‘Attention is All You Need’, please watch the video below.

At it’s core, the concept of attention is defined as the following:

We can see in the above function definition that it takes three parameters. Query, Key and Value. The Value tensor is the tensor that contains the information from our sample that will be mapped to the target. For example, this could be a representation of the sentence that we wish to translate. The Query tenser is a tensor that contains information about the different sections of the Value that it wants to pay attention to. However, the Query, by itself doesn’t map directly to the Value tensor. It needs to go through various transformations before the information that it contains can be applied to the Value tensor. This is where the Key tensor comes in. The Key Tensor, when applied to the Query Tensor via a Dot-Product, produces an output that can be mapped onto the Value tensor. The soft-max function is applied to this Dot-Product in order to bolster the weights. When this is multiplied with the Value Tensor, the sections of the Value tensor that the Query tensor wished to concentrate on will be made more prominent and the other sections of the Value tensor will be reduced. Thus the network is only paying attention to certain parts of the input, deemed most relevant, when producing predictions.

Transformers in the wild

A note of types of Transformers

Two stages of training

Before I go into detail about the different types of transformers, wanted to spend a moment talking about the similarities and differences between them. This’ll help you quickly understand any new types of transformers that appear in the future.

Firstly, all transformers as I’m aware follow a two staged process for training.

The first stage is known as the pre-training stage. During pre-training, the model is trained on one or more NLP tasks. These are normally quite popular NLP tasks such as next sentence prediction or sentiment analysis. The tasks normally come as a pair; one trying to predict the next word or sentence and the other trying to predict a classification of a sentence or document. This allows the models to be more generalised to a larger number of tasks for the second phase.

The second phase is known as fine-tuning. Without going into too much detail, this is the stage where the ML practitioner continues to update the weights trained from the pre-training phase, on the dataset in use for their specific problem.

The keen eyed might recognise this process as transfer learning and you’d be right. This is another huge bonus that transformers are providing. We are already now in a position where teams with large resources are pre-training transformers to near SOTA performance on general NLP tasks and then publishing them to the ML community to use.

Network Structure

The Networks can also largely be split into two camps:

Auto-encoder
Autoregressive.

The Auto-encoder style networks are simply reading the input all in one go and are then trying to create a representation of that input. This represents a large step away from the world of RNNs we talked about at the beginning. However, the Autoregressive models keep a more traditional perspective and apply the network to the input in a sequential manner.

Examples of Transformers

I will now present a brief summary of three transformer networks that have been published for public use and have been provided an interface by Spacy.

BERT (Bidirectional Encoder Representations from Transformers)

This model takes in vector of tokens as input and outputs predictions dependent on the task objective. This is an example of an Auto-encoding model.

Above, you can see three examples of transformer model architectures, with BERT on the far left. Here we see the OpenAI GPT, has weights that only connect to nodes to the right and for ELMo we see that there are actually two distinct collections of hidden layers being connected separately. Both of these models are attempting to connect the models in such as a way that the sequential nature of the input is respected. That is to say that there are directional dependencies between words in a sentences and so when it comes to, for example, predicting the next word, the model must learn these dependencies. BERT proposed an alternative network structure that involved far less computing to train, where these dependencies were baked into the input.

Input to BERT

The input to BERT is a vector that represents the sequence(s). This vector is an aggregation of 3 vectors.

Token Embeddings: These are the original vector representations of the tokens.
Segment Embeddings: These are vectors that indicate if a token belongs to sentence 𝐴 or sentence 𝐵.
Position Embeddings: These vectors indicate the position of the token in the sequence.

Training a BERT model

Pre-Training

There are two pre-training tasks that are performed on BERT. There are:

Masked Language Model (MLM)
Next Sentence Prediction (NSP)

Task 1: Masked Language Model

Given a sentence with obscured words, the model must be able to predict the identity of the missing words.

The input sequence is inputted as a vector into a fully connected neural network. This allows the model to be bidirectional as it can observe words either to the left or the right of the target. To avoid the model trivially echoing the input value for the target, the target words are masked. This is called a Cloze task in the literature

Issue with Masking

The masking presents one issue with pre-training. During the Fine Tuning process, the token for the Mask token will not exist, causing an inconsistency with the training process.

To fix this they do the following:

If a token has been selected for masking, the actions performed on said token are taken with the given probabilities:

the token is replaced with the mask token 80% of the time
the token is replaced by a random token in the corpus 10% of the time.
the token is left untouched 10% of the time

Task 2: Next Sentence Prediction

Given a pair of sentences, the model must be able to distinguish between a pair of sentences that logically follow one another and a pair that do not.

The data for NSP is constructed by creating a collection of sentence pair, classification tuples. The first two items are Sentence 𝐴 and 𝐵 and the third is an indicator 𝑖𝑠𝑁𝑒𝑥𝑡, that is True when 𝐴 and 𝐵 logically follow one another and False when they don’t. The data is constructed so that 50% of the collection features sentences randomly chosen so that sentence 𝐵 does not follow sentence 𝐴.

Fine Tuning BERT

By design, BERT’s architure allows for a large variety of tasks to be performed during the fine-tuning Process.

The Pre-Training activities were chosen to reflect the requirements of most NLP tasks so that the model could easily be updated during the fine-tuning process.

At the input, sentence A and sentence B from pre-training are analogous to:

(1) sentence pairs in paraphrasing
(2) hypothesis-premise pairs in entailment
(3) question-passage pairs in question answering
(4) a degenerate text-∅ pair in text classification or sequence tagging.

At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis

XLNET

XLNet is described as a Generalised Autoregressive Pre-Training for Language Understanding

Problem with Auto-Regression

The original paper also discusses the drawbacks of other auto-regressive models, namely that having a bidirectional view of the input vastly improves the performance on standard NLP tasks.

However, instead of offering an Auto-encoder alternative, they propose an generalised auto-regressive model.

Problem with BERT

There are two major drawbacks to the mask approach taken by BERT:

The Mask tokens don’t exist during the fine-tuning stage of training leading to an inconsistency that can comprimise performance during the fine-tuning stage.
BERT selects the tokens to be Masked randomly. This means that there will be occurences where two tokens that were both Masked, are dependant on each other. As both tokens have been masked, the model will not be able to capture this.

XLNet’s proposed fix

XLNET proposes that in order to capture the bidirectional nature of the information flow of sequences, we train the model on the possible permutations of the sequence as well as the original sequence.

Using permutations, the model is exposed to the tokens that appear on either side of the target during the pre-training process. It achieves this while also avoiding the drawbacks of the masked input approach used by BERT.

Permutation Procedure

Partial Predictions

It is not always sensible to attempt to train using all possible permutations. For example say we have the inputted sentence:

The cat climbed up the tree

Below is our permutation. The word tree is the target to be predicted. If we fed the model the following permutation:

_ _ _ The climbed cat the up

The model would not have any information to go on before the first obscured token, which we know is tree.

To tackle this issue, XLNet selects the last 𝑁 tokens of any permutation order to be the target tokens. This means that there is always going to be a minimum amount of information about the sequence available to the model in order to predict the next word.

Two-Stream Self-Attention for Target-Aware Representations

The largest issue with the current XLNet setup is that the model has no ability to know which token in the sequence is the target token. As a result the model will likely just output the same distribution over and over again.

To ensure that the model is aware of the position of the target token, XLNet uses a special target-aware input representation.

It is important that while the position of the target is made known to the model, it still remains uninformed about the content of the target token, otherwise the prediction problem becomes trivial.

Dual Input Representation: Content Stream and Query Stream

Again, for a detailed explanation of streams and XLNet in general, please watch the video below:

High Level Model view of the model

Top Left — A traditional self attention transformer
Bottom Left — A two stream approach where 𝑔g represents the query stream and ℎh is the content stream
Right — The role of the attention mask in creating the permutations and also how the query / content streams encode the target word without revealing its content.

A quick bonus model: RoBERTa (A Robustly Optimised BERT Pre-Training Approach)

RoBERTa is a model that was built by Facebook in an effort to demonstrate that BERT was been under-trained in the original paper.

The Optimisations

In their paper they mention that they do the following:

Dynamic Masking

In the original BERT paper they performed a one time Masking procedure over the data during the data preprocessing stage. This created a statically masked dataset. RoBERTa, instead dynamically Masks the data at the input stage, resulting in a large variety of Mask scenarios exposed to the model.

Full-Sentence without NSP

In BERT, one of the pre-training tasks is to predict whether sentence 𝐵 logically follows from sentence 𝐴. However, RoBERTa instead takes as input, several sentences sampled from one or more separate documents such that their length is at most 512 tokens. They also only performed the Masked Language Model, without second pre-training task (NSP).

Modified Training

Used a text encoding that could represent a large vocabulary.
Increased the amount of data used to train
Increased the size of the mini batches
Increased the number of training steps

These are the results that the paper published:

Spacy

Now for the practical stuff :)

What has been amazing about these transformers is that people have already created their own implementations and generate pretty good performances. One such example of this is the transformers implemented by the group at Hugging Face. This group is made up of awesome people who are interested in NLP and getting these models into the real world. Thanks!

These models have been reviewed and have been incorporated into the Spacy library for NLP tasks in python. You can read more about these transformers and how to use them here.

I had a go at implementing a basic fine-tuning phase pipeline off the back of these examples. I’ve detailed the code below as a starting point for you to stand playing around with the transformers.

Plug and Play: using a pre-trained model for sentence similarity

Import the module

import spacy

These transformers are build using pytorch, so you have the option to use GPU’s to speed things up in you have them available to you.

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

Download the pre-trained transformers

$ python -m spacy download <model name> e.g. model name = en_trf_bertbaseuncased_lg

Load the model

# Loading in Model
nlp = spacy.load("en_trf_bertbaseuncased_lg")

Creating example corpus comprised of two sentences taken from a mathematical article and one from a historical article.

# Taken from "https://en.wikipedia.org/wiki/Derivative"math_sent_a = """The most common approach to turn this intuitive idea into a """\
              """precise definition is to define the derivative as a limit of """\
              """difference quotients of real numbers"""math_sent_b = """The fundamental theorem of calculus relates antidifferentiation """\
              """with integration"""# Taken from "https://en.wikipedia.org/wiki/John_Major"history_sent_c = """He went on to lead the Conservatives to a record fourth """\
                 """consecutive electoral victory, winning the most votes in """\
                 """British electoral history with over 14 million votes at """\
                 """the 1992 general election, albeit with a reduced majority """\
                 """in the House of Commons"""

Run these through Spacy’s

math_nlp_a = nlp(math_sent_a)
math_nlp_b = nlp(math_sent_b)
history_nlp_c = nlp(history_sent_c)

Comparing two mathematical sentences

print(math_nlp_a[0].similarity(math_nlp_b[0]))> 0.6418499

Comparing a sentence about maths and a sentence about history

print(math_nlp_a[0].similarity(history_nlp_c[0]))> 0.31078288

Fine Tuning a model

Import modules

import spacy
from spacy.util import minibatch
import random
import torch

Check for GPU’s

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

Import model

nlp = spacy.load("en_pytt_bertbaseuncased_lg")

Create corpus (sentence sentiment in this case)

TRAIN_DATA = [
    ("He told us a very exciting adventure story.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("She wrote him a long letter, but he didn't read it.", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("I am never at home on Sundays.", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("He ran out of money, so he had to stop playing poker.", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
]

Example pipeline taken from here

print(nlp.pipe_names) # ["sentencizer", "pytt_wordpiecer", "pytt_tok2vec"]textcat = nlp.create_pipe("pytt_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
    textcat.add_label(label)
nlp.add_pipe(textcat)optimizer = nlp.resume_training()
for i in range(10):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for batch in minibatch(TRAIN_DATA, size=8):
        texts, cats = zip(*batch)
        nlp.update(texts, cats, drop=0.2, sgd=optimizer, losses=losses)
    print(i, losses)

For me this outputted:

['sentencizer', 'pytt_wordpiecer', 'pytt_tok2vec']
0 {'pytt_textcat': 0.03125}
1 {'pytt_textcat': 0.02763683721423149}
2 {'pytt_textcat': 0.02363896369934082}
3 {'pytt_textcat': 0.008534247055649757}
4 {'pytt_textcat': 0.002479486633092165}
5 {'pytt_textcat': 0.0003107782104052603}
6 {'pytt_textcat': 0.02999699115753174}
7 {'pytt_textcat': 0.00012167354725534096}
8 {'pytt_textcat': 3.1756288080941886e-05}
9 {'pytt_textcat': 1.2619821063708514e-05}

Here are links to all the resources I used to understand transformers myself. I’ve incorporated all the media used in this article from these sources.