Summary of a novel technique for attentive language modeling that supports longer-term dependency.

Jan 11 · 5 min read


Language modeling has been recently addressed using unsupervised training methods such as ELMo and BERT. However, it still remains a challenge to properly equip neural networks with long-term dependency.

Recent models were designed with an attention mechanism to help ease optimization — by dealing with vanishing gradient — and enable the learning of long-term dependency. However, the context is of fixed-length in these cases so the model cannot capture longer-term dependency and suffers from a problem known as context fragmentation.

Context fragmentation refers to when the model lacks the necessary contextual information to predict the first few symbols due to the way the context was selected — usually without respect to a sentence or semantic boundaries.

Moreover, previous models don’t support information flow across segments during training and employ fixed context length, which means there is no room for the model to capture longer-term dependency.

In the context of language modeling, hidden states can be reused to allow information flow across segments (a kind of memory). This could help to support longer-term dependency and deal with context fragmentation. However, for the architecture to support state reuse, temporal coherence must be managed, as we discuss next.


During training, vanilla language models don’t make effective use of context information and segments are treated individually. In addition, semantic boundaries during segmentation are usually not respected since most methods employ standard chunked sequences of fixed lengths. During the evaluation, fixed-length contexts are used and segments are processed from scratch, which becomes expensive, even though context fragmentation is somewhat addressed. This paper aims to focus on the problem of efficiency by better modeling longer-term dependency.

In language modeling, Transformer networks are limited by a fixed-length context and thus can be improved through learning longer-term dependency. The paper proposes a novel method called Transformer-XL (meaning extra long) for language modeling, which enables a Transformer architecture to learn longer-term dependency — via a recurrence mechanism — beyond a fixed length without disrupting temporal coherence.

The method is different from other previous approaches that focus on other strategies to support long-term dependency such as additional loss signals and augmented memory structure.

A segment-level recurrent mechanism is introduced which enables the model to reuse previous hidden states at training time, addressing both the issues of fixed-length context and context fragmentation. In other words, the historical information can be reused and it can be extended up to as much as GPU memory allows. See the training and evaluation phases in the figure below.

Transformer-XL — training and evaluation phase (figure source)

To properly reuse hidden states, the authors propose a mechanism called relative positional encodings which helps to avoid temporal confusion. Current models can’t distinguish the positional difference between inputs in different segments at different layers. Relative position encoding addresses this problem by encoding positional information bias in the hidden states, which differs from other approaches that perform this as the input level.

Since a Transformer architecture is involved, the process above is achieved by computing the relative distance between each key vector and query vector and injecting it into the attention score. With some new parameterization trick of the terms used to derive the attention score between query and vector, the relative position information can be incorporated. The recurrence component is now equipped with the proposed relative positional embedding and this whole procedure represents the proposed Transformer-XL architecture.


Transformer-XL obtains strong results for both word-level and character-level language modeling applied to a variety of datasets such as WikiText-103, text8, and One Billion Word.

The proposed model is compared with a vanilla model that was recently used for character-level language modeling (Al-Rfou et al., 2018), which also leverages deeper self-attention. Note that the vanilla model cannot support dependency lengths larger than the upper bound segment length.

Transformer-XL reduces previous SoTA perplexity score on several datasets such as text8, enwiki8, One Billion Word, and WikiText-103. Besides the SoTA performances, the authors claim that the method is more flexible, faster during evaluation (1874 times speedup), generalizes well on small datasets, and is effective at modeling short and long sequences. See a summary of some of the results obtained on the different datasets in the Tables below.

You can check the rest of the results in the full paper linked below.

Other Benefits

An ablation study to examine the effects of both the recurrence mechanism and the proposed positional encoding scheme is provided in the paper as well.

The authors also propose a new metric called Relative Effective Context Length that provides a fair way to compare models that are tested with increased context lengths.

Further Readings

If enough interest is expressed, I may feel tempted to prepare a code walkthrough for this work. It contains many different components that could be interesting and useful for NLP practitioners and researchers.

Diverse Artificial Intelligence Research & Communication


Written by


ML-NLP Research Scientist | Ph.D. | Educator | Speaker | Find me on Twitter ( and LinkedIn (

Diverse Artificial Intelligence Research & Communication

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade