Transformer-XL : Unlocking Long-term Dependencies for Effective Language Modelling

Published in

Saarthi.ai

7 min readJun 25, 2019

Language modeling has become an important NLP technique, thanks to its application in various NLP tasks such as Machine Translation (MT) and topic classification. Previously, Recurrent Neural Networks(RNN) and its variance, Long Short Term Memory(LSTM), formed the base of every Natural Language Processing(NLP) challenge. However, recent advancements in the field have shown that language modeling has been addressed using unsupervised training methods such as ELMo and BERT.

Although the above-mentioned architectures have achieved impressive results, their main limitation is capturing long term dependencies, e.g. use of important words from the beginning of the document to predict words in a subsequent part.

The introduction of Transformers XL tackles the problem of language modeling in an innovative way, overcoming the issues in long term dependencies . Transformer-XL achieves state-of-the-art (SoTA) results on multiple language modeling datasets such as enwik8 (word-level) and text8 (character-level), while being significantly faster during inference (300x-1800x) than the previous SoTA Transformer architecture. In this article, we will learn about the key features of Transformer-XL, and how it achieves such remarkable results.

Let’s get started!

A Little Background on Transformers

Before jumping straight into Transformer-XL, we need to understand the basics of the Transformer, and its working.

Transformer, introduced in 2017, introduced a new approach — attention modules. Instead of processing tokens one by one, attention modules receive a segment of tokens, and learn the dependencies between all of them at once using three learned weight matrices i.e Query, Key and Value, which form an Attention Head.

The Transformer network consists of multiple layers, each with several Attention Heads (and additional layers), used to learn different relationships between tokens.

Source: Analytics India : Transformer Architecture

As we have seen in several NLP models, the input tokens are first embedded into vectors. Due to the concurrent processing in the attention module, the model also needs to add information about the order of the tokens, a step named Positional Encoding, that helps the network learn their position. In general, this step is done with a sinusoidal function that generates a vector according to the token’s position, without any learned parameters.

While the Transformers were originally used for Machine Translation (with an encoder-decoder mechanism), Al-Rfou et al presented an architecture for language modeling. The goal is to predict a character in a segment based on its previous characters, for example, it predicts character Xn using X1…Xn-1, while the next characters to the right are masked.

This 64-layer vanilla Transformer model is limited to relatively short inputs consisting of only 512 characters, therefore it splits the input to into different segments and learns from each one separately. To process longer inputs in evaluation, it predicts one character at a time by shifting the input by one in each step.

Vanilla Transformer with a fixed-length context at training time. Source: Google blog

This model outperforms RNN models on popular benchmarks, however, it still suffers from two shortcomings:

Limited context-dependency — The maximum dependency distance between characters is limited to the length of the input. For example, the model can’t “use” a word that appeared several sentences ago.
Context fragmentation — It refers to when the model lacks the necessary contextual information to predict the first few symbols due to the way the context was selected — usually without respect to a sentence or semantic boundaries. For texts that are longer than 512 characters, every segment of that size is trained separately from scratch. Therefore, there is no context (dependencies) at all for the first tokens of each segment, and between segments. This leads to inefficient training and might affect model performance.

Vanilla Transformer with a fixed-length context at evaluation time. Source-Google

Note: A detailed explanation of the Transformer can be found here.

A look into Transformer-XL

Transformer-XL relies heavily on the vanilla Transformer, but introduces two innovative techniques — Recurrence Mechanism and Relative Positional Encoding — to overcome the limitations of its predecessor , the Transformer. An additional advantage of Transformer-XL is that it can be used for both word-level and character-level language modeling.

What is Recurrence Mechanism?

The goal of the recurrence mechanism is to enable long term dependencies by using information from the previous segments. In particular, instead of computing the hidden states from scratch for each new segment, it reuses the hidden states obtained in previous segments.

The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long- term dependency becomes possible because information can be propagated through the recurrent connections without disrupting the temporal coherence. Meanwhile, passing the information acquired in the previous segment can also resolve the problem of context fragmentation.

Transformer-XL similar to vanilla transformer processes the first segment of tokens but keeps the output of the hidden layers to be used in the following segments. When the following segment is processed, each hidden layer receives two inputs i.e the output of the previous hidden layer of that segment, and the output of the previous hidden layer from the previous segment. This allows the model to create long-term dependencies.

Transformer-XL with segment-level recurrence at training time. Source: Google blog

Technically, the two inputs are concatenated, and then used to calculate the Key and the Value matrices of the current segment. This addition provides the network with more information in regards to the weights of each token, but it doesn’t change the Value matrix.

This concept can be expanded to incorporate longer dependencies by using information from numerous previous segments in the same way (under the limitations of the GPU memory). An added advantage of the recurrence mechanism is its speed in evaluation — In each step, it can advance by an entire segment (and not by one token as in the vanilla transformer) and use the previous segment’s data to predict the current segment tokens.

Relative Positional Encoding

The recurrence mechanism introduces a new challenge i.e, the original positional encoding handles each segment separately and, as a result, tokens from different segments have the same positional encoding. For example, the first token of the first and the second segments will have the same encoding, although their position and importance are different. This confusion might affect the network incorrectly.

To properly reuse hidden states, the authors propose a mechanism called relative positional encodings, which helps to avoid temporal confusion. Current models can’t distinguish the positional difference between inputs in different segments at different layers. Relative position encoding addresses this problem by encoding positional information bias in the hidden states, which differs from other approaches that perform this at the input level.

Since a Transformer architecture is involved, the process above is achieved by computing the relative distance between each key vector and query vector and injecting it into the attention score. With some new parameterization trick of the terms used to derive the attention score between query and vector, the relative position information can be incorporated. The recurrence component is now equipped with the proposed relative positional embedding, and this whole procedure represents the key features in the proposed Transformer-XL architecture.

Transformer-XL with segment-level recurrence at evaluation time. Source: Google blog

Results

Transformer-XL has obtained new state-of-the-art (SoTA) results on a variety of major Language Modeling (LM) benchmarks, including character-level and word-level tasks on both long and short sequences. Empirically, Transformer-XL enjoys three benefits:

Transformer-XL learns dependencies that are approximately 80% longer than RNNs and 450% longer than vanilla Transformers, which generally have better performance than RNNs, but are not the best for long-range dependency modeling due to fixed-length contexts.
Transformer-XL is more than 1,800 times faster than a vanilla Transformer during the evaluation on language modeling tasks because no re-computation is needed.
Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term dependency modeling, and also on short sequences by resolving the context fragmentation problem.

Transformer-XL has achieved the State of The Art(SoTA) perplexity on several benchmark datasets.

On WikiText-103, a large word-level dataset, the 18-layer Transformer-XL model consisting of 257M parameters reached a perplexity of 18.3 compared to previous SoTA that reached 20.5.
On enwik8, a character-level dataset, the 24-layer Transformer-XL achieved a perplexity of 0.99 bpc (bytes per character) improvement to the previous SoTA perplexity that was 1.06 bpc.
Transformer-XL achieved impressive results on a dataset with only short-term dependencies — One Billion Word with only individual sentences. The achieved perplexity is 21.8 as compared to the previous SoTA perplexity that was 23.7.
Transformer-XL also achieves the SoTA perplexity on a small dataset — Penn Treebank with only 1M tokens without fine-tuning. The previous perplexity was 55.3 while Transformer-XL achieved 54.5.

Conclusion

The two key features of Transformer-XL i.e recurrence mechanism and relative positional encoding has helped Transformer-XL overcome the existing issues such as context fragmentation and long term dependency. This has resulted in obtaining better perplexity results as compared to RNNs and Vanilla Transformer. It is also substantially faster during the evaluation and is able to generate coherent text articles.

Transformer-XL is the Optimus Prime of the Transformer. Bigger and Better…

Thanks for reading!

In case of any questions feel free to comment. I would be really happy to help you out. If this article has been helpful to you in any way, don’t hesitate to click the clap button!

For more articles around various research topics in NLP, follow the dialogue.