Day 9: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Francisco Ingham
A paper a day avoids neuron decay
8 min readApr 3, 2019

[Jan 18, 2019] How to improve your Transformer’s performance through increased long-term coherence by using recurrence

XL will beat a regular Transformer anytime

TL-DR

Why Attention is All You Need? Attention is nice but most of us need more than that, we need some sense of meaning. The Original Transformer architecture is no exception. By only computing self-attention on arbitrary batches and not using recurrence, the original architecture breaks sentences and does not allow it to use previous batches to predict the first few symbols of the current batch. This hurts performance greatly.

The authors propose a new architecture which uses recurrence to conserve meaning across batches.

Note: If you haven’t yet I suggest you read my Dissecting Bert series along with Miguel Romero Calvo to understand the Original Transformer architecture.

What is the problem again?

The problem is the following. Imagine we are training on the script from The Matrix. Say our sentence is:

Morpheus: “This is your last chance. After this, there is no turning back. You take the blue pill — the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill — you stay in Wonderland and I show you how deep the rabbit-hole goes.”

Say our text length is 16 and batch size 1 for simplicity. Depending on the tokenization, the first sentence would be ‘Morpheus: “This is your last chance. After this, there is no turning back. You take the’. We will feed these words in a tokenized form and ask the network to predict the next word for each of the words (masking all the words that come after it). In the next iteration we will feed: ‘blue pill — the story ends, you wake up in your bed and believe whatever you’.

Do you see the problem? How does it know what is it talking about? How will it predict that the first word is ‘blue’? Even the second word, how does it know it is talking about a ‘pill’?

Notice how x_5 cannot use x_4, x_3 … as part of its context
In the evaluation phase we have limited context and we need to re-compute the activations time and time again

XL stands for recurrence 😰

The name of the paper is misleading. The big crux of this paper is the introduction of recurrence in the weights of the attention matrix. Let’s dive right into it.

To address the limitations of using a fixed-length context, we propose to introduce a recurrence mechanism to the Transformer architecture. During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment.

The gradient remains within a segment but the additional history allows the network to model long-term dependency and avoid context fragmentation.

The model can use previous segments’ information to compute next segments

The equations that define this recurrence are the following:

Equation defining the combined hidden state
The combined hidden state is used for computing the keys and values, the queries are computed using the segment’s hidden state
Finally we compute the layer’s output by running self-attention and feed-forward

where ○ represents concatenation and SG is stop gradient, n is the layer number and τ is the segment number.

Notice that we have a recurrence between the hidden states. Also notice that the recurrence is between h^n and h^n-1, that is between a hidden state and the next. This means that the effective number of segments taken into account grows linearly with the depth of the layer and the segment size, O(N+L). This can be seen in the shaded area in the image below.

(…) during evaluation, the representations from the previous segments can be reused instead of being computed from scratch as in the case of the vanilla model. In our experiments on enwiki8, Transformer-XL is up to 1,800+ times faster than the vanilla model during evaluation.

Previous segment hidden vectors are reused and this saves time

Notice that the number of hidden states from previous segments that can be used is not fixed as in the example; it depends on the GPU’s memory. The size of the segment was chosen to represent the number of hidden states that fit into the GPU.

Positional Encodings

There is a problem with using the default positional encodings, can you guess what?

Remember that the Original Transformer had absolute positional encodings. Each position in the segment (which was padded to conserve a fixed length) had an specific value which was added to the embedding. See the problem yet? The problem is that if we did that with TransformerXL, all our segments would have these same numbers added to them irrespective of their position in the overall text. This would not help the network determine which information is closer and this would hurt performance since the proximity of words is an important piece of information to compute attention.

Original Transformer Positional Encodings would give the same value to words in different segments
Mathematical representation of using the encodings from the Origi To understand the specific encoding they suggest let’s start by the original one:nal Transfromer where E is the embedding and U the positional encoding

What do they propose? A new scheme based on relative encodings. The authors note that the network just needs to know the relative position of the key for each vector given a query for a vector. To understand the specific encoding they suggest let’s start by the original one. Since both the query matrix and the key matrix will be multiplied by the matrix containing the embeddings and the encodings we can expand this multiplication to get:

Original Transformer query times key computation

They replace the four appearances of the encoding to:

New equation with the new encodings
  1. All the appearances of the absolute positional embedding U_j are replaced by the relative encoding R_i-j. R is a sinusoid encoding matrix.
  2. They introduce trainable parameters u and v to replace U_i*W_q. This is because the relative position of each key vector towards the query vector is caught by the R_i-j term and thus the attentive bias towards different words should be the same for every query vector (that is to say, a word that is three positions behind word A has the same positional encoding effect as a word that is three positions behind word C).
  3. They separate the weight matrices W_k_e and W_k_R for producing the content-key vectors and location-key vectors.

Each term has an intuitive meaning here, (a) is content-based addressing, (b) is content-dependent positional bias, (c) is a global content bias and finally (d) is a global-positional bias.

Results

Transformer-XL achieves SOTA in the most important large corpus datasets in English.

In WikiText-103 which is the largest word-level language modelling benchmark with long-term dependency, Transfromer-XL improves SOTA by 11%.

TXL achieved SOTA in WikiText-103

In enwiki8, the Transformer-XL broke the 1.0 barrier on bpc for the first time in widely-studied character-level benchmarks. Notably, the 12-layer Transformer-XL got the same result as the 64-layer Vanilla Transformer with 17% of the parameters.

TXL achieved SOTA in enwiki8 breaking the 1.0 bpc

text8 is very similar to enwiki8 and again the model outperforms the previous SOTA (the same Vanilla Transformer).

TXL achieved SOTA in text8 breaking the 1.0 bpc

The One Billion Word dataset does not preserve long-term dependency because sentences have been shuffled. Although Transfromer-XL did not achieve SOTA, it did achieve improve single-model SOTA from 23.7 to 21.8 and outperformed vanilla Transfromer suggesting that the advantage of Transformer-XL is generalizable to short sequences.

TXL achieved single-model SOTA in One Billion Word dataset

Finally, the Penn-Treebank dataset is very small (1M) tokens. In this dataset Transformer-XL achieved SOTA among non-finetuned models suggesting that the model can generalize to small datasets.

TXL achieved non-finetuned SOTA in Penn-Treebank dataset

Ablation Studies

The authors performed two ablation studies. The first one to understand whether the relative positional encodings made a difference or not. The second one to understand how much of the increase in performance was due to context fragmentation and how much was due to long-term coherence preservation.

To test for the positional encoding effect, the authors compared different models with different encodings and full and half losses. Half loss means taking the loss of the last half of the segment and full loss means computing the loss of the full half of the segment. The authors find that absolute positional encodings (Vaswani and Al-Rfou) only work well with half losses because half losses deliberately exclude positions with short attention lengths. However, their proposed method together with recurrence and a high attention length are needed to achieve the lowest possible loss.

Different encodings with different losses

To test for context fragmentation, the authors tested on a dataset which does not have long-term coherence, so that any improvement at all must be due to the context fragmentation problem (One Billion Word). They find that adding recurrence significantly decreases perplexity in this dataset, thereby showing that this model helps solve context fragmentation.

TXL improves performance in One Billion Word, a dataset with shuffled sentences and no long-term coherence

Relative Effective Context Length

The authors devise a measure to compare how much models would benefit from an increased context span (RECL). This measures how useful extra context is for models (relatively to other models). They find that the Transformer XL has a RECL between 80% to 450% longer than both RNN’s and Original Transformer. Both the recurrence mechanism and the new positional encodings contribute to this improvement as we can see in the next figure.

RECL across models

Evaluation Speed

As discussed before, Transformer-XL achieves a 1,874 times speedup compared to the Vanilla Transformer due to state reuse (see Figure (b) Evaluation Phase).

Speed-up for different receptive attention lengths

--

--