Perplexity of Language Models

5 min readNov 26, 2022

Perplexity is an evaluation metric that measures the quality of language models. In this post, we will discuss what perplexity is and how it is calculated for the popular model GPT2.

What is a Language Model?

You might have already heard of large language models(LLM) such as BERT, GPT2 etc., that have changed the face of Natural Language Processing. There are different types of language models such as Statistical language models and neural language models. The SLMs are based on statistics in the given text where as the Neural language models are trained using neural network architectures.

At its core, a language model(LM) is nothing but a probability distribution over a set of words which is known as vocabulary of the model. It tells the probability of a certain word in the vocabulary occurring given all its previous words. Usually, whichever word that has the maximum probability is selected as the next predicted word in the sequence.

This probability can be calculated by multiplying a sequence of conditional probabilities for each word given its previous words which gives the likelihood of this sequence.

For example, the joint likelihood of the example sentence “It is a beautiful day” is written as shown below. Calculating this probability helps us to predict next or missing words in a sequence and thus the model learns the nuances of the language — hence the term language model.

P(It,is,a,beautiful,day) = P(day|beautiful, a, is, it) * 
P(beautiful|a, is, it) * P(a|is, it) * P(is|it)

Language Models have been successfully used for many NLP tasks such as speech recognition, text classification, generation etc.,

In the next sections, we will discuss some important terms that are used to calculate Perplexity.


Entropy is a measure that quantifies uncertainty and is obtained as the inverse of probability of an event occurring. Higher the probability, lesser is the uncertainty. Hence, the goal of the language model is to minimize the entropy of generating a sequence of words that are similar to the training sequences. The formula for calculating Entropy is as given below where P(x) is the probability of the word x.

Formula for Entropy of a Probability Distribution

Here’s a great video to understand Entropy in more detail:

Cross Entropy

Cross Entropy compares two probability distributions P(x) and Q(x). In the context of language models, we compare the predicted probability distribution over the words with the actual probability distribution. Here, P(x) is the actual probability distribution and Q(x) is the model predicted distribution. The cross entropy is then calculated as shown below which can be used as a loss function to train language models.

Let’s say we have a language model that has been trained with a vocabulary of only 5 words “sunny”, “day”, “beautiful”, “scenery”, “clouds”. Now, we want to calculate the perplexity of the model when it sees the phrase “beautiful scenery”.

Let us calculate the cross entropy using a simple example in PyTorch.

# Get the needed libraries
import torch
from torch.nn import functional as F

Let us say that the actual two words in the target phrase are “beautiful”, “scenery”. Assume a language model has generated the logits(outputs) as shown below for the given input.

These logits are then passed to a softmax function that normalizes the values and converts them into a probability distribution. This is the predicted probability distribution.

tensor([[-0.7891,  1.3421,  0.4929,  0.0715, -0.0910],
[ 0.9024, -0.8675, 0.8498, -1.0331, 0.5531]])

F.softmax(input, dim = -1)


tensor([[0.0575, 0.4841, 0.2071, 0.1359, 0.1155],
[0.3369, 0.0574, 0.3196, 0.0486, 0.2375]])

But, in our vocabulary, the target words are represented by the indices 2 and 3. Let us also represent the targets as a probability distribution which translates to 0 and 1s.

Target/Actual: tensor([2, 3])

tensor([[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]])

Applying the formula, we multiply the respective true probabilities with the corresponding log of predicted probability. Total loss is calculated by taking the mean of losses with total number of classes as shown below.

Loss for First Word:
(((- 0 * log(0.0575)) + (- 0 * log(0.4841)) + (- 1 * log(0.2071)) +
(- 0 * log(0.1359)) + (- 0 * log(0.1155))) = 1.5745535105805986

Loss for Second Word:
(((- 0 * log(0.3369)) + (- 0 * log(0.0574)) + (- 0* log(0.3196)) +
(- 1* log(0.0486)) + (- 0 * log(0.2375))) = 3.024131748075689

Loss = (1.5745535105805986 + 3.024131748075689)/2 = 2.299

Verify this loss with the CrossEntropyLoss function provided by the PyTorch library and it matches.

loss = torch.nn.CrossEntropyLoss()

output = loss(input, target)

Loss: tensor(2.299)


Intuitively, perplexity means to be surprised. We measure how much the model is surprised by seeing new data. The lower the perplexity, the better the training is.

Perplexity is calculated as exponent of the loss obtained from the model. In the above example, we can see that the perplexity of our example model with regards to the phrase “beautiful scenery” is 9.97. The formula for perplexity is the exponent of mean of log likelihood of all the words in an input sequence.

Formula of Perplexity from HuggingFace

Now, let us compare perplexity of two sentences with GPT2 and see how perplexed it is. We first load a tokenizer and a causal head for the GPT2 model from HuggingFace:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("ABC is a startup based in New York City and Paris", return_tensors = "pt")
loss = model(input_ids = inputs["input_ids"], labels = inputs["input_ids"]).loss
ppl = torch.exp(loss)

Output: 29.48

inputs_wiki_text = tokenizer("Generative Pretrained Transformer is an opensource artificial intelligence created by OpenAI in February 2019", return_tensors = "pt")
loss = model(input_ids = inputs_wiki_text["input_ids"], labels = inputs_wiki_text["input_ids"]).loss
ppl = torch.exp(loss)

Output: 211.81

As you can see, the first sentence is one of the sequences on which the model was trained on and hence the perplexity is much lower in comparison to the second sentence. The model has not seen the second sentence before and hence the GPT2 model is more perplexed by it.

Perplexity is usually used only to determine how well a model has learned the training set. Other metrics like BLEU, ROUGE etc., are used on the test set to measure test performance.