Two minutes NLP — Perplexity explained with simple probabilities

Language models, sentence probabilities, and entropy

Fabio Chiusano
NLPlanet
5 min readJan 27, 2022

--

Photo by Wojciech Then on Unsplash

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models.

A language model is a probability distribution over sentences: it’s both able to generate plausible human-written sentences (if it’s a good language model) and to evaluate the goodness of already written sentences. Presented with a well-written document, a good language model should be able to give it a higher probability than a badly written document, i.e. it should not be “perplexed” when presented with a well-written document.

Thus, the perplexity metric in NLP is a way to capture the degree of ‘uncertainty’ a model has in predicting (i.e. assigning probabilities to) text.

Now, let’s try to compute the probabilities assigned by language models to some example sentences and derive an intuitive explanation of what perplexity is.

Computing perplexity from sentence probabilities

Suppose we have trained a small language model over an English corpus. The model is only able to predict the probability of the next word in the sentence from a small subset of six words: “a”, “the”, “red”, “fox”, “dog”, and “.”.

Let’s compute the probability of the sentence W, which is “a red fox.”.

The probability of a generic sentence W, made of the words w1, w2, up to wn, can be expressed as the following:

P(W) = P(w1, w2, …, wn)

Using our specific sentence W, the probability can be extended as the following:

P(“a red fox.”) =

P(“a”) * P(“red” | “a”) * P(“fox” | “a red”) * P(“.” | “a red fox”)

Suppose these are the probabilities assigned by our language model to a generic first word in a sentence:

Probabilities assigned by a language model to a generic first word w1 in a sentence. Image by the author.

As can be seen from the chart, the probability of “a” as the first word of a sentence is:

P(“a”) = 0.4

Next, suppose these are the probabilities given by our language model to a generic second word that follows “a”:

Probabilities assigned by a language model to a generic second word w2 in a sentence. Image by the author.

The probability of “red” as the second word in the sentence after “a” is:

P(“red” | “a”) = 0.27

Similarly, these are the probabilities of the next words:

Probabilities assigned by a language model to a generic third word w3 in a sentence. Image by the author.
Probabilities assigned by a language model to a generic fourth word w4 in a sentence. Image by the author.

Finally, the probability assigned by our language model to the whole sentence “a red fox.” is:

P(“a red fox.”) =

P(“a”) * P(“red” | “a”) * P(“fox” | “a red”) * P(“.” | “a red fox”)

= 0.4 * 0.27 * 0.55 * 0.79

= 0.0469

It would be nice to compare the probabilities assigned to different sentences to see which sentences are better predicted by the language model. However, since the probability of a sentence is obtained from a product of probabilities, the longer is the sentence the lower will be its probability (since it’s a product of factors with values smaller than one). We should find a way of measuring these sentence probabilities, without the influence of the sentence length.

This can be done by normalizing the sentence probability by the number of words in the sentence. Since the probability of a sentence is obtained by multiplying many factors, we can average them using the geometric mean.

Let’s call Pnorm(W) the normalized probability of the sentence W. Let n be the number of words in W. Then, applying the geometric mean:

Pnorm(W) = P(W) ^ (1 / n)

Using our specific sentence “a red fox.”:

Pnorm(“a red fox.”) = P(“a red fox”) ^ (1 / 4) = 0.465

Great! This number can now be used to compare the probabilities of sentences with different lengths. The higher this number is over a well-written sentence, the better is the language model.

So, what does this have to do with perplexity? Well, perplexity is just the reciprocal of this number.

Let’s call PP(W) the perplexity computed over the sentence W. Then:

PP(W) = 1 / Pnorm(W)

= 1 / (P(W) ^ (1 / n))

= (1 / P(W)) ^ (1 / n)

Which is the formula of perplexity. Since perplexity is just the reciprocal of the normalized probability, the lower the perplexity over a well-written sentence the better is the language model.

Let’s try computing the perplexity with a second language model that assigns equal probability to each word at each prediction. Since the language models can predict six words only, the probability of each word will be 1/6.

P(“a red fox.”) = (1/6) ^ 4 = 0.00077

Pnorm(“a red fox.”) = P(“a red fox.”) ^ (1/4) = 1/6

PP(“a red fox”) = 1 / Pnorm(“a red fox.”) = 6

…which, as expected, is a higher perplexity than the one produced by the well-trained language model.

Perplexity and Entropy

Perplexity can be computed also starting from the concept of Shannon entropy. Let’s call H(W) the entropy of the language model when predicting a sentence W. Then, it turns out that:

PP(W) = 2 ^ (H(W))

This means that, when we optimize our language model, the following sentences are all more or less equivalent:

  • We are maximizing the normalized sentence probabilities given by the language model over well-written sentences.
  • We are minimizing the perplexity of the language model over well-written sentences.
  • We are minimizing the entropy of the language model over well-written sentences.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence