Understanding Language using XLNet with autoregressive pre-training

An explanation of background to XLNet & why XLNet outperforms BERT on NLP tasks

Maggie Xiao
11 min readMay 4, 2020

Not long after BERT developed by Google took the Natural Language Processing (NLP) community by a storm, researchers from Carnegie Mellon University and Google AI Brain team presented XLNet in a recent NeurIPS 2019 conference paper, leaving quite an impression on the NLP community. XLNet outperforms BERT on 20 NLP benchmark tasks, usually with large margin, and thus becomes not only exciting for researchers, but also important for NLP practitioners.

XLNet leverages the best of both autoregressive (AR) language modeling and autoencoding (AE), the two most well-known pretraining objectives, while avoiding their limitations. The method can be applied to a variety of NLP downstream language tasks including question answering, sentiment analysis, natural language inference, document ranking and so on.

Considered as one of the 2019’s most important developments in NLP, XLNet combines the autoregressive language model, Transformer-XL, and bidirectional capability of BERT to unleash the power of this important language modeling tool.

As the Deep Mind researcher Sebastian Ruder has said: “The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a new model by people from CMU and Google outperforms BERT on 20 tasks.”

In this blog post, we will look at the exciting development of XLNet from Z. Yang et al.’s paper, the new go-to technique for transfer learning, and why it is performing better than BERT, the state-of-the-art pretraining approach. Since many previous work and techniques have paved the way to XLNet, we will also cover the important background related to XLNet.

State-of-the-Art

Natural language processing (NLP) tools are becoming increasingly important in machine translation, reading comprehension and summarization, question answering, and sentiment analysis. The typical approach has been using supervised learning on datasets that are task-specific. In recent years, unsupervised representation learning methods represented by BERT (Bidirectional Encoder Representations from Transformers), CoVe (Context Vector), ELMo (Embeddings from Language Models) have gained increasing attention. Language modeling is essentially predicting the next word in a sentence given previous words. These language modeling methods pretrain neural networks on large-scale unlabeled text corpora before fine-tuning the model for subsequent tasks. In other words, language modeling includes two phases, the pretraining phase and fine-tuning phase.

XLNet combines the advantages of AR and AE

For the pretraining phase, the two most successful architectures are autoregressive (AR) language modeling and autoencoding (AE). Before seeing how XLNet achieves unprecedented performances, we will dive into the two aforementioned pretraining perspectives whose advantages XLNet combines. Here, we will see how they work and what are the limitations:

1. Autoregressive (AR) Language Modeling

In conventional AR models, unidirectional context either in the forward or backward direction in a text sequence is encoded. It is useful for generative NLP tasks that generate context in the forward direction. However, AR falls short in case when bidirectional context needs to be utilized simultaneously. This could become problematic especially with downstream language understanding task where bidirectional context information is required.

Generative Pre-Training (GPT) and GPT-2 from OpenAI are both standard AR models.

Google DeepMind’s WaveNet illustrates the feed-forward fashion of an autoregressive model.

In AR, a parametric model such as a neural network is trained to model the joint probability distribution of a text corpus, for either a forward product or a backward product conditioned on the words before or after the predicted token.

2. Autoencoding (AE) language model

An AE based model has the capability of modeling bidirectional contexts by reconstructing the original text from corrupted input ([MASK]). AE model is thus better than AR model when it comes to better capturing bidirectional context.

A notable example of AE is BERT that is based on denoising autoencoding. However, it suffers from a pretrain-finetune discrepancy arising from the dependency between the masked tokens and unmasked ones. In particular, [MASK] used in the pretraining stage is absent from the real data used at downstream tasks including the fine-tuning stage. For high-order, long-range dependency characteristics in natural language, BERT oversimplifies the problem by assuming predicted tokens (masked in the input) are independent of each other as long as the unmasked tokens are given.

While AR can estimate the probability of either a forward or backward product in the form of conditional probability distribution, BERT cannot model the joint probability using the product rule due to its independence assumption for the masked tokens.

3. How does XLNet differ from conventional AR and AE (BERT)?

The authors of XLNet propose to retain the benefits of AR language model while having it learn from bidirectional context as AE models (e.g., BERT) during the pretraining phase. The interdependency between tokens will be preserved, unlike in BERT. The proposed new objective is called “Permutation Language Modeling.

Illustration of the permutation language modeling objective from the paper. The “mem” node has the read-only activations from the previous tokens without back-propagating to them.

The basic idea behind this modeling is "Permutations". In this illustration above from the paper, we see an example for predicting the x3 token given the same input sequence x1 →x2 →x3 →x4 with 4 tokens. For a sentence with N tokens, there will be N! permutations. In this case, there are a total of 24 permutations, and the illustration demonstrates 4. In each permutation/factorization order, the (t-1) tokens that proceed the token of interest (at t-th position) will be feed forward into the hidden layers to predict the t-th token. In this example, we are predicting x3. The benefit of using permutation language modeling is to capture information from both sides by varying the factorization order. Note that the input sequence order is not randomly permuted since we need to preserve natural order during finetuning. Only the factorization order is permuted.

Here, the goal is to maximize the expected log-likelihood of a word sequence considering all the possible permutations of the factorization order. The following permutation language modeling objective formalizes the idea, where the first (t-1) tokens in the factorization order is used to predict the t-th token.

Permutation language modeling objective. Possible permutation: Z_T.
  • Factorization orders: z~Z_T
  • Likelihood function p_θ
  • x_{z_t}: the t-th token in the factorization
  • x_{z<t}: first (t-1) tokens before t-th token

However, as the paper has pointed out, naive implementation with standard Transformer parameterization won't work. The standard Transformer does not fulfill the following two requirements:

  1. predict token x_{z_t} at z_t position based on only the position z_t, not the content of x_{z_t}.
  2. predict token x_{z_t} with all the contents of the tokens before x_{z_t} encoded.

One key property of the Transformer is that it includes position encoding into the token embedding. Hence, the position information is inseparable from the token embedding. In the case of permutation language modeling, this Transformer property poses a problem since the position information remains the same even when the sequence is shuffled, producing identical model prediction for different target positions.

Transformer with positional encoding vectors added to the embeddings. Illustration from illustrated transformer.

Different from BERT and other transformers that combines position embedding and content embedding for prediction, XLNet predicts the next-token distribution by taking into account the target position z_t as input. This brings us to the Two-Stream Self-Attention architecture XLNet proposes.

XLNet uses Two-Stream Self-Attention Architecture to be target-aware

Two-Stream Self-Attention architecture is employed to address the problems traditional Transformer poses. Just as what the name suggests, the architecture consists of two different types of self-attention. The first one is content stream representation, same as the standard self-attention in Transformer that considers both content (x_{z_t})and position information (z_t). The other one is query representation, it essentially replaces the [MASK] from BERT, learned by query stream attention to predict x_{z_t} only with position information but not its content. Only the position information of the target token and context information before the token is available.

XLNet with 2 sets of hidden representation. The content representation uses content stream as standard Transformer(-XL). The query representation helps to compute target-position-aware next-token distribution.
Two-Stream Self-Attention for Target-Aware Representations, from XLNet paper (with annotations)

The end result of Two-Stream attention is target-aware prediction distribution. The main difference between XLNet and BERT is that XLNet is not based on data corruption as BERT does, so it can avoid BERT's limitations arising from masking, described earlier in AE model.

Comparison of standard transformer and target-aware two-stream self-attention.

Transformer-XL

XLNet integrates relative encoding scheme and segment recurrence mechanism from Transformer-XL to capture dependencies that are farther away than RNNs and Transformer could. The blog post here gives a light introduction to Transformer-XL. The relative positional encoding is applied based on the original sequence. The segment-level recurrence mechanism avoids the context fragmentation embodied by fixed-length segment processing. It allows sentence segments from the past to be reused with the new segment. The Transformer-XL realizes this by including segment-level recurrence in the hidden states. The following illustration shows the major difference between (vanilla) Transformer and Transfomer-XL.

Comparison between vanilla Transformer and Transformer-XL.

As mentioned earlier in the blog post, the standard Transformer contains positional information in the positional encodings, matrix U, with absolute positional embedding. Transformer-XL, on the other hand, encodes the relative distance dynamically into the attention score by introducing matrix R. In the attention score for Transformer-XL, we see the four terms represent content-based addressing, content-dependent positional bias, global content bias and global positional bias, respectively. With Transformer-XL, coherent text articles can be generated and there is also a substantial speedup during evaluation compared to RNNs and standard Transformer.

In the XLNet, Transformer-XL is included into the pretraining framework. Recurrence mechanism from Transformer-XL is thus incorporated into the proposed permutation setting in the XLNet to reuse hidden states from previous segments. The factorization order in the permutation from previous segments will not be cached and reused in the future. Only content representation of the segment is retained in the hidden states.

XLNet Results

XLNet combines bidirectional capability of BERT and the autoregressive technology of Transformer-XL to achieve substantial improvement; it beats BERT in more than a dozen tasks. Empirically speaking, XLNet outperforms BERT in:

  • GLUE language understanding tasks
  • Reading comprehension tasks (SQuAD and RACE)
  • Text classification tasks (Yelp and IMDB)
  • ClubWeb09-B document ranking task
  • etc.

To see which design choice for XLNet affects the performance more, the authors carried out an interesting ablation study on Wikipedia and the BooksCorpus. The results are shown in the following table.

Ablation studies shows superior performance of XLNet compared to BERT.

The pretrained models are evaluated on downstream tasks to justify the design choices in XLNet. In particular, Transformer-XL backbone and the permutation LM play a heavy role in improving XLNet’s performance over that of BERT.

  • RACE (ReAding Comprehension from Examinations) dataset is a challenging benchmark for long text understanding.
  • SQuAD (Stanford Question Answering Dataset) is a large-scale reading comprehension dataset with paragraphs and corresponding questions.
  • GLUE (General Language Understanding Evaluation) dataset consists of 9 natural language understanding tasks. GLUE/SST-2 (Stanford Sentiment Treebank) consists of sentences from movie reviews and sentiments. GLUE/MNLI (Multi-Genre Natural Language Inference Corpus) is for entailment analysis.

Beside NLP, XLNet can likely unleash its power in computer vision tasks and reinforcement learning problems.

Implementation code is available from:

The authors of the XLNet paper have released the pretrained models and code. A simple XLNet implementation with PyTorch wrapper is also available from GitHub developers.

Main paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding — (Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le)

Appendix: Related Work and Further Readings

  • [Transformer] Attention Is All You Need (2017) Since two main attention papers (Content-based attention in Graves et al., 2014, and Location-based attention in Luong et al., 2015)were published in 2014 and 2015, attention mechanisms have been actively researched in many areas. Attention mechanism is particularly useful for memorizing long source sentences, unlike encoder-decoder based seq2seq model. In 2017, researchers from Google Brain and Research made significant improvements by introducing Transformer architecture. Rather than using recurrent models (e.g., RNN, LSTM) with encoder-decoder architectures that can be fairly complex, this simplified attention mechanism only relies on attention to draw global dependencies between both the input and output. On two machine translation tasks, the Transformer model proves to be superior in quality and also requires much less training time.

The following diagram illustrates how the transformer works on a high-level.

Transformer allows parallelization of processing unprecedented in RNNs based models. In the paper, six encoder-decoder stacks are used in the Transformer to replace the recurrent layers that have been commonly used in encoder-decoder architectures. This is the first sequence transduction model that is based on attention only.

This blog post gives a good summary on self-attention at a high level. If you want to go deeper, this post from Google AI blog presents an excellent summary on Transformer.

The pretraining methods of Transformer-XL is incorporated into XLNet for segment recurrence mechanism and relative encoding schemes.

GPT-2 is a large-scale unsupervised language model. It is huge transformer-based with 1.5 billion parameters, trained on WebText, a collection of 45 millions of webpages. The model outperforms 3 out of 4 baseline systems without using 127,000+ training examples. This pretrained model can be used for downstream tasks directly without the need for modification or supervised adaptation. The model is capable of generating synthetic yet coherent paragraphs of text and perform a variety of NLP tasks across diverse domains with state-of-the-art results. GPT-2 is a direction scale-up version of GPT. Another take-home message is that unsupervised learning techniques with sufficient unlabeled data can be useful for building language processing systems.

  • [BERT] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)

Before XLNet is invented, BERT provides the state-of-the-art results on eleven NLP tasks. Conceptually simple and empirically powerful, it pretrains deep bidirectional representations with bidirectional transformer from unlabeled textual data to learn language representation. More information on BERT discussed in the beginning of the blog post.

  • [RoBERTa] (A Robustly Optimized BERT Pretraining Approach from Facebook AI)

It is a retraining of BERT with improved performance compared to BERT. It removes the Next Sequence Prediction task that BERT uses. It also includes dynamic masking whereas BERT uses fixed masked token during training. The pretraining stage of RoBERTa uses 160 GB text for pretraining, and uses 1024 V100 Tesla GPU for running. RoBERTa performs better than both BERT and XLNet on GLUE benchmark results.

RoBERTa achieving state-of-the-art results on the GLUE task development sets.

In XLNet, the author compared XLNet to NADE. NADE models are neural network architectures used for estimating unsupervised distribution and density. It is also a permutation model, but differs from XLNet in a few ways. In NADE, the "orderless" bias is baked into the model to improve density estimation. XLNet has a different goal, learn bidirectional contexts with AR language models. NADE relies on implicit position in the multilayer perceptron network, while XLNet uses two-stream attention to include target position information into the hidden state.

[Current Leaders of the GLUE benchmark]:

--

--

Maggie Xiao

Ph.D. researcher, UCLA Electrical & Computer Engineering. Experienced in data science, machine learning, and AI. Also, in magnetism and nanotechnology.