XLNet speaks. Comparison with GPT-2

Aman Rusia
Jun 30 · 6 min read

This was not me, but the XLNet model talking (prompt text is in the bold). For more samples and quick usage go to https://github.com/rusiaaman/XLNet-gen.

Introduction to XLNet

Three of the most successful and effective strategies of Language Modelling are:

  1. Unidirectional/Causal Language Modelling: Words are fed in an auto-regressive manner from left to right or right to left. All models LSTMs and Transformers alike employ this strategy which also mimics how humans speak: sequentially.
  2. BERT [Nov 2018]: Which can be better called “Bidirectional Masked Language Modelling”, it models probability of only a few masked words in a sentence. This was a huge milestone in NLP community because of benefits obtained from large scale pre-training using BERT. However, there has been only limited success in language generation using BERT, since it is not straightforward to generate sentences thought it.
  3. Generalized permutation language modelling [XLNet — Jun 2019]: The idea is that probability of any sequence can be modeled using any permutation in an auto regressive fashion. The key development is on how it is done. XLNet utilized transformer architecture and introduces a novel two-stream attention mechanism to achieve the same.

Peculiarities of XLNet training

We don’t thank Google enough. They (CMU/Google Brain) released a pre-trained model the day they introduced XLNet to the world through the Arxiv preprint. The way XLNet is trained for permutation language modelling, a few challenges come into the way of text generation.

During training 85 tokens out of 512 are set as target for prediction. The way target and non-target tokens are handled is different. All the non-target tokens can attend to each other. All the target tokens can attend to all the non-target tokens too, but they attend to only those target tokens which come before in the [permuted] sequence.

Another peculiarity of the training procedure is presence of context around each target token. Specifically, the targets are prepared by masking n-grams with about (alpha-1)*n context surrounding the masked tokens, where alpha is set to be 6. That’s the reason why on an average 2.2 consecutive tokens are set for prediction while being surrounded by 11 non-target tokens which can attend to all other non-target tokens.

Edit: removed a section which was found inaccurate due to a bug in my code. Apology for any one who was misguided. XLNet can generate language in autogressive fashion with good accuracy.

Comparison with GPT-2

The differences between GPT-2 and XLNet on how they were trained, relevant to language modeling, are as follows:

  1. GPT-2 uses a novel byte pair encoding which operates on utf-8 byte sequences themselves, but XLNet uses byte pair encoding of SentencePiece library which operates on Unicode strings. Because of this GPT-2 can assign probability to any sequence of characters. XLNet has restricted vocabulary, doesn’t handle multi-lingual characters or emojis. This is the reason we see <unk> being generated from time to time with XLNet-gen.
  2. GPT-2 is trained on web scrapped text (reddit curated) which amounts to 40GB of data. XLNet is trained on multiple datasets which amount to 136 GB of data.
  3. GPT-2 pre-trained model with 365M parameters has the same number of parameters as the largest released XLNet model.
  4. GPT-2 models text left to right, but XLNet can model it in any permutation possible. However, during generation the current implementation of XLNet-gen uses only left-to-right decoding.

XLNet

Before boarding your rocket to Mars, remember to pack these items. Mars, of course, is not that kind of space station. This site has some links to a number of Mars-related resources, but I think you find these items a little bit more useful or helpful looking at them. In any case, don’t leave some “off-net” resource in a room or drawer. You may not have time to go to it. If you do get to visit it, know you would want to see the materials and “feel” them, rather than reading them. A good resource would be a few short “books” of materials for a different location. This would be handy for visiting locations in the United States — not necessarily in the same city as some of the other sites mentioned, but in a different state or rural area. Another book of the same material would be best of use for visiting the same site. If you are using a travel guide from a similar company, you would not need any other materials if it was located outside of the United States. These materials were published before the advent of the Internet. Remember to check these publications before you embark on a journey to that site.

This resource links back to other sites relating to Mars, which they are, but some links do not carry much information about Mars. Remember to check the links provided so you don’t have to go to the original sites for everything you need for your Mars-related experience. You may even find some information that you find helpful, but not in these sites. Remember to note that we are here for space, not science.

GPT-2

Before boarding your rocket to Mars, remember to pack these items with the following:

Rockets. A rocket, even a small size, will fill up your space pack (even though it’s a much larger pack than a backpack).

A rocket, even a small size, will fill up your space pack (even though it’s a much larger pack than a backpack). Survival essentials. Just like on Earth, you’ll likely be doing a lot of exploring and survival before reaching Mars (unless otherwise noted).

Just like on Earth, you’ll likely be doing a lot of exploring and survival before reaching Mars (unless otherwise noted). A couple of survival snacks. It’s okay to skip meals, like these. Food can be hard to find on Mars, so having a snack will help you eat during your time on the planet.

It’s okay to skip meals, like these. Food can be hard to find on Mars, so having a snack will help you eat during your time on the planet. Some water. A quick trip through the desert wastes a little water during your flight, so have a couple bottled (or just a few cups) of water.

A quick trip through the desert wastes a little water during your flight, so have a couple bottled (or just a few cups) of water. Luggage. Luggage comes in handy every once in awhile

Observations

  1. GPT-2 can generate new line characters, whereas only end-of-paragraph and end-of-document tokens are available for XLNet to generate. GPT-2 can thus retain the structure of the articles it was trained on while XLNet, because of the way it was pre-processed, doesn’t model new line characters. Double new line characters are thus absent.
  2. After looking at multiple samples, I feel XLNet is more coherent in it’s generation even though its samples have grammatical errors more frequently than GPT-2.

Accuracy of unsupervised learning

Language models can learn facts just by being trained on large amount of text. I present my early non-rigorous findings on the differences between their performance for unsupervised question answering à la “Language Models are Unsupervised Multitask Learners”:

GPT-2 345M Score: 8/17
XLNet 340M Score: 6/17

Examples of questions asked and the answers:

Q: Panda is a national animal of which country?
XLnet: united states
gpt-2: china

Q: Who came up with the theory of relativity?
XLnet: Einstein
gpt-2: Albert Einstein

Q: When was the first star wars film released?
XLnet: 1977
gpt-2: Star Wars: Episode IV A New Hope.

The above results were generated by prefacing the questions from other sample questions and answers, a trick first used in GPT-2 paper. Note: a better approach would be using beam search for decoding the answers which is not used here so results may vary.

Conclusions

Invention of XLNet is a new milestone in NLP community. It has shown impressive results in tasks like extractive question answering (SQUAD), sentiment classification, natural language inference and so on. It derives its benefit from deep bi-directional representation it obtains through permutation language modelling and efficient training using the novel two-stream attention.

Its benefit is, unfortunately, not apparent in language generation tasks where GPT-2 beats it by slight margin. However, proper scientific comparison and results on LM tasks will be needed to conclusively say this.

It will be interesting to see how permutation LM is used to improve text generation process, but till then GPT-2 remains the most accurate text generation model.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade