Understand how the XLNet outperforms BERT in Language Modelling

Karan Purohit
Saarthi.ai
Published in
7 min readJul 10, 2019

The world of NLP was shaken by the XLNet. The new approach to language modeling, outperformed the mighty BERT on 20 NLP tasks, and achieved state-of-the-art results on 18 of them.

Photo by Gery Wibowo on Unsplash

XLNet will probably transform language modeling , and that’s why it is an important addition to the arsenal of any NLP practitioner. In this article, we will discuss the secret sauce behind the XLNet, what makes it better than BERT. To understand it better, we will also take a look into through related techniques that have come before it.

An Introduction to Language Modeling

Photo by Dmitry Ratushny on Unsplash

In the year 2018, the world of NLP witnessed momentous advancements, with language modelling tasks at the center of research.

Language modeling is the task of predicting the next word in a sentence, given all previous words. Language models have become a vital part of the NLP pipeline now, as they provide the backbone to various downstream tasks. Language models capture general aspects of the input text that is almost universally useful.

Earlier language models ULMFiT and ELMo, both of which used LSTM based language models. Indeed, both ULMFiT and ELMo, were a massive success, producing state-of-the-art results on numerous tasks. But we’ll see how XLNet can achieve unprecedented results.

Autoregressive Model (AR) for Language Modelling

XLNet is a generalized Autoregressive Pre-training Model. An Autoregressive Model is merely a feed-forward model, which predicts the future word from a set of words given a context. But here, the context word is constrained to two directions, either forward or backward.

The Autoregressive Model can be run sequentially to generate a new sequence! Start with your seed x1,x2,…,xk and predict xk+1. Then use x2, x3,…,xk+1 to predict xk+2, and so on. The GPT and GPT-2 are both Autoregressive language model. So, we can say that they work well in text generation.

Problem with Autoregressive Language Model is that it only can use forward context or backward context, which means it can’t use forward and backward context at the same time, thereby being limited in its understanding of context, and prediction.

Autoencoder(AE) Language Modelling

Unlike the AR language model, BERT uses Autoencoder (AE) language model. The AE Language Model aims to reconstruct the original data from corrupted inputs.

In BERT, pre-training input data is corrupted by adding [MASK]. For example, ‘Goa has the most beautiful beaches in India’ will become ‘Goa has the most beautiful [MASK]in India’ and objective of the model will be to predict the [MASK] word based on the context words. The advantage of Autoencoder language model is, that it can see the context on both forward and backward direction. However, due to the addition of [MASK] in the input data introduces a discrepancy in fine-tuning the model.

What went wrong with BERT?

Although, by using AE language modeling BERT achieved SOTA in almost all NLP tasks, it still has some loopholes in its implementation. There are two major drawbacks in BERT model:

  1. The discrepancy in fine-tuning due to masking

BERT is trained to predict tokens replaced with the special[MASK]token. The problem is that the [MASK] tokens never appear while fine-tuning BERT on downstream tasks. In most cases, BERT simply copies non-masked tokens to the output.

So, would it really learn to produce meaningful representations for non-masked tokens? It is also not clear what happens if there are no [MASK]tokens in the input sentence.

2. Predicted tokens are independent of each other

BERT assumes the predicted (masked) tokens are independent of each other given the unmasked tokens. To understand this let’s go through one example.

Whenever she goes to the [MASK] [MASK] she buys a lot of [MASK].

This can be filled as

Whenever she goes to the shopping center, she buys a lot of clothes.

Or

Whenever she goes to the cinema hall she buys a lot of popcorn.

but, the sentence

Whenever she goes to the cinema hall she buys a lot of clothes.

is not valid. BERT predicts all masked positions in parallel, meaning that during training, it does not learn to handle dependencies between predicting simultaneously masked tokens. In other words, it does not learn dependencies between its own predictions. It predicts the tokens which are independent of each other. The reason this can be a problem is that this reduces the number of dependencies BERT learns at once, making the learning signal weaker than it could be.

The Secret Sauce of XLNet : Permutation Language Modeling

What made BERT stand out among all traditional language models, was its ability to capture the bidirectional context. Again, its major flaw was introducing [MASK] tokens and parallel independent predictions in pre-training.

If we somehow build a model that incorporates bidirectional context while avoiding the [MASK]token and parallel independent predictions, then that model will surely outperform BERT and achieve the state-of-the-art results.

That, essentially, is what XLNet achieves.

XLNet does this by using a variant of language modeling called “permutation language modeling”. Permutation language models are trained to predict one token given preceding context like traditional language model, but instead of predicting the tokens in sequential order, it predicts tokens in some random order. To make this clear, let’s take the following sentence as an example:

“Sometimes you have to be your own hero.”

A traditional language model would predict the tokens in the order

“Sometimes”, “you”, “have”, “to”, “be”, “your”, “own”, “hero”

where each token uses all previous tokens as context.

In permutation language modeling, the order of prediction is not necessarily left to right. For instance, it could be

“own”, “Sometimes”, “to”, “be”, “your”, “hero”, “you”, “have”

where “Sometimes” would be conditioned on seeing “own “, “to” would be conditioned on seeing “own” & “Sometimes” and so on.

Notice how the model is forced to model bidirectional dependencies with permutation language modeling. In expectation, the model should learn to model the dependencies between all combinations of inputs in contrast to traditional language models that only learn dependencies in one direction.

XLNet utilizes the Transformer XL

Aside from using permutation language modeling, XLNet utilizes the Transformer XL, which improves its results further.

Key ideas behind the Transformer XL Model:

  • Relative positional embeddings
  • Recurrence mechanism

The hidden states from the previous segment are cached and frozen while conducting the permutation language modeling for the current segment. Since all the words from the previous segment are used as input, there is no need to know the permutation order of the previous segment.

The Power of Two-Stream Self-Attention

For language models using the Transformer model, when predicting a token at position i, the entire embedding for that word is masked out including the positional embedding. This means that the model is cut off from knowledge regarding the position of the token it is predicting.

This can be problematic, especially for positions like the beginning of the sentence, which have a considerably different distribution from other positions in the sentence. To address this problem, the authors introduce a second set of representations that incorporate positional information, but mask the actual token, just for the sake of pre-training. This second set of representations is called the query stream. The model is trained to predict each token in the sentence using information from just the query stream.

The original set of representations that includes both the positional embedding and token embedding is called the content stream. This set of representations is used to incorporate all the information relevant to a certain word during pre-training. The content stream is used as input to the query stream, but not the other way around. This scheme is called “Two-Stream Self-Attention “.

For each word, the query stream uses the content stream which encodes all the available contextual information for words up to the current word.

For example, we are predicting the word “calm” in the sentence below

“Keep calm and read papers “

, where the previous words in the permutation were “and” and “papers “. The content stream would encode information for the words “and” and “papers “. The query stream would encode the positional information of “calm”, and the information from the content stream which would then be used to predict the word “calm“.

Conclusion

XLNet is really an exiting introduction to the world of Natural Language Processing, and will surely become the topic of discussion among researches. It shows that there is still a lot to be explored in language modeling and transfer learning in NLP.

Further Readings

XLNet paper

Transformers blog post

My previous blog post on ELMo

blog post on BERT

I hope this article helped you in understanding the key concepts of XLNet. For any doubts, feel free to reach out to me through the comments section, and I’ll be thrilled to help you. If you liked this article, please give it a clap.

For more articles like this, follow our Facebook page

--

--

Karan Purohit
Saarthi.ai

Deep Learning Engineer @Saarthi.ai | From Computer Vision to NLP, now in Speech!