No need to be perplexed by perplexity

Published in

Analytics Vidhya

4 min readNov 29, 2019

I was intrigued by the name of perplexity when I first heard this term in natural language processing. So I thought of writing an article. Trust me on this perplexity is not at all what it sounds.

Generally, perplexity is a state of confusion or a complicated and difficult situation or thing. Technically, perplexity is used for measuring the utility of a language model. The language model is to estimate the probability of a sentence or a sequence of words or an upcoming word. In this article, you will get to know what perplexity really is and is really simple to understand.

Introduction

Perplexity is a measurement of how well a probability model predicts test data. Basically, It’s a probability distribution over a sentence, phrases, a sequence of words, etc. Perplexity is a variant that we use as a metric for evaluating language models. A low perplexity indicates the probability distribution is good at predicting the sample. Let’s see how it is used while evaluating language models.

Maths behind perplexity

Basically, the best model is the one that best predicts an unseen test set. The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words. For a test set with words W = w_1, w_2, …, w_N, the perplexity of the model on the test set is:

This way the longer the sentence the less probable it will be. Then again by the chain rule:

Perplexity is a function of the probability of the sentence. The meaning of the inversion in perplexity means that whenever we minimize the perplexity we maximize the probability.

The perplexity of a probability distribution

The perplexity of a discrete probability distribution p is defined as the exponentiation of the entropy:

Source: https://en.wikipedia.org/wiki/Perplexity

where H(p) is the entropy of the distribution p(x) and x is a random variable over all its possible values

The entropy is a measure of the expected or average number of bits required to encode the outcome of the random variable. Entropy can be seen as an information quantity whereas perplexity can be seen as the number of choices the random variable has.

Example:

Consider tossing a fair coin, probabilities of coming up heads or tails.

The entropy of the unknown result of the next toss of the coin is maximized i.e. if heads and tails both have equal probability 1/2.

So, Entropy is 1.

Perplexity is 2.

Entropy uses logarithms while Perplexity with its e^ brings it back to a linear scale. A good language model should predict high word probabilities. Therefore, the smaller the perplexity the better.

Perplexity as a branching factor

An interpretation of any exponentiated entropy measure is as a branching factor (the weighted average number of choices a random variable has): entropy measures uncertainty in bits but in the exponentiated form, it’s measured as the size of an equally weighted distribution with equivalent uncertainty. That is, exp(−H(p)) is how many sides you need on a fair die to get the same uncertainty as to the distribution p

Entropy differs by a constant depending on whether you measured using base-2 or natural logarithms but perplexity is the same with whichever base you want.

How perplexity and probability are correlated?

Minimizing perplexity is the same as maximizing the probability

Higher probability means lower Perplexity
The more information, the lower perplexity
Lower perplexity means a better model
The lower the perplexity, the closer we are to the true model

References:

Thank you for reading till here. Stay tuned for more!