CMU & Google XLNet Tops BERT; Achieves SOTA Results on 18 NLP Tasks

Published in

SyncedReview

4 min readJun 24, 2019

In 2018 Google released BERT (Bidirectional Encoder Representations from Transformers), a large-scale natural language pretraining model that achieved state-of-the-art performance on 11 NLP tasks and stimulated NLP research across academia and industry. A team of researchers from Carnegie Mellon University and Google Brain have now proposed XLNet, a new language model which outperforms BERT on 20 language tasks including SQuAD, GLUE, and RACE; and has achieved SOTA results on 18 of these tasks. XLNet’s training code and model have been open-sourced on GitHub.

The CMU and Google researchers suggest that pretraining models such as the BERT platform which are based on denoising auto-encoding can model bidirectional context better than pretraining methods based on auto-regressive language modeling. Models like BERT however mask part of the input, which can result in pretrain-finetune discrepancies between the pretraining generic model and the fine-tuned model with specific data and cases.

XLNet is a generalized autoregressive pretraining model that combines the advantages of auto-regressive (AR) language modeling and auto-encoding (AE) while avoiding the shortcomings of both (although existing unsupervised pretraining objectives each have their own advantages and disadvantages, AR and AE are the best among them). Instead of using the traditional fixed forward or backward factorization orders in AR-based models, XLNet maximizes all possible sequences of the factorization order to learn bidirectional contexts, which enable each position to learn contextual information from all positions, namely bidirectional context capturing.

As a generalized AR language model, XLNet does not rely on fragmented data and so won’t suffer from the aforementioned pretrain-finetune discrepancies like BERT does. At the same time, an AR objective uses the product rule to factorize joint probability of the predicted units, eliminating BERT’s independence assumption to improve the relevancy of contextual information.

Furthermore, XLNet also improves pretrained architecture design by integrating the relative positional encoding scheme and the segment recurrence mechanism of the SOTA autoregressive model Transformer-XL into pretraining. Experiments show that this approach tremendously improves XLNet performance on language tasks that contain long text sequences.

The above features allowed XLNet to surpass BERT’s performance on 20 tasks, with SOTA performance on 18 tasks, including question answering, natural language inference, sentiment analysis, and document ranking.