Perplexity and accuracy in classification

Published in

unpack

8 min readFeb 1, 2021

Perplexity is a metric used essentially for language models. But since it is defined as the exponential of the model’s cross entropy, why not think about what perplexity can mean for the fundamental case most of us learn about before language models, which uses cross entropy as a loss function: classification. Think image classification, for example imagenet, or just different species of animals, or sports logos. In classification, the metric usually used is accuracy. So let’s reflect on the possible relationship between perplexity and accuracy.

Definitions

Let’s briefly first recall the definitions of entropy, cross-entropy, and perplexity and how they are related to each other, without going in depth, and with almost no reference to language models. More detailed accounts, with good intuitive explanations can be found in the references below.

For a model m modeling a reality, and the real probability distribution p of that reality, say the probability of a sentence following a sequence of sentences, then the cross entropy is defined as

where n is the number of possible states, assuming we have discrete states.

The entropy of that ‘reality’ is then its cross entropy with itself:

The perplexity of the model m is the exponential of its cross entropy:

Perplexity for Classification

First, let’s see what perplexity would work out to be, and evaluate the perplexity of the “perfect model”.
Actually, first, let’s define the perfect model. It’s the model whose prediction probabilities are always 1 at the correct label and 0 at the others.

with all 0's except 1 in the place of the correct label.

For a given sample (eg. image) in the test set, its cross entropy with our candidate model is:

Each term vanishes except for the term for the correct label, so it’s:

The perplexity is then:

The perplexity of the whole test set is then the product of the perplexities of its samples, normalized by taking the Number-of-samples-eth root:

Each term is ≥ 1, as it is the inverse of a probability. The higher the model’s probabilities, both the correct and incorrect ones, the lower the perplexity.

Perfect Perplexity

For a first reference, what is the perplexity of the perfect model?

First, note that for the more meticulous among us who may object about the case with absolute 0 probability in the formulas above, then what about

, as is the case for the perfect model?

Let us recall then that this is still 0, as

As a reminder, here is what the graph of x log(x) looks like

So for the perfect model, every term vanishes, even the ‘surviving’ term above, because it is

which = 0. Thus, the entropy of the perfect model is 0!

Unlike a language model, where perfection is unachievable, here we do have a theoretical notion of perfect model, which actually realizes a perplexity of 1, because each term in the above product is 1, which is equivalent to its entropy being 0.

Good Perplexity, relative to Accuracy

Of course, there is no perfect model though. Let’s instead think of a reasonable standard for perplexity and how this jives with accuracy.
We generally like to measure the quality of a classification model by its accuracy. We know that the accuracy is unfortunately not differentiable, and therefore cannot be used as a loss function. Inversely, we have all learned that the loss function, which is generally cross entropy for classification, is not a metric. But perplexity is just the exponential of the loss function. So how can it be useful as a metric?

What I propose here is a rough way to measure the quality of the model, given its accuracy, even if it is a low accuracy. If I told you: here is a model with 68% accuracy and it’s all you can use in your work (say culling images to include only those of your own products for display on your website). Or say I gave you two different models, each with 68%. Which one would you chose? That’s lousy, but what could you still hope from such a model to use it with confidence?

Here is what I would wish:

Another way of saying this intuitively is: the model should still be on the right track despite getting it wrong much of the time. It should know what it knows, and have confident predictions in those cases, and know what it doesn’t know, and have safely spread prediction in those cases. I would call it: confident when correct and correct when confident, as well as safe when incorrect. How about calling that 2CWC.

Now, what does the last condition above mean? What does it translate to quantitatively? How about this: the first and second highest predictions of the model should total fairly high when one of them is correct, say 0.9. Then the remaining predictions must be on average ≤ 0.1/ N-2, where N is the number of labels. So in those cases where the top 2 predictions are incorrect, which is hopefully a minority of the cases, the prediction for the correct label should still be ≥ 1/N-2

Let’s give a quantified example of requirements. Perfecting it will be left to the reader according to their preferences, or better, according to heuristic findings over many experiments. It can be formalized by replacing the bounds below with less arbitrary definitions, but this should give you the idea.

One could also develop this further into top 3 predictions or top 4 or top k, rather than top 2. You would then be dividing by N-k at the end rather than N-2, but let’s leave it there for simplicity. Does this sound reasonable? I hope so. We can then evaluate the result on perplexity.

Let’s say we have 1000 labels, like in the ILSVRC (1) imagenet competition. Then N-2 = 998.

Contribution from correct (highest) predictions (68% of them):
At most

Contribution from correct second highest predictions:
At most

Contribution from incorrect first and second highest prediction cases:
At most

Given the constraints proposed above, the last 2 contributions will be at their greatest when the proportions are at their defined limits, of 75/100 and 25/100 respectively.

Finally, we get an upper bound for the perplexity of a “good” 68% accuracy model:

= 3.1147017916565107

So there we have it. If we are happy with the criteria above for our 68% accuracy model to have, then we can compare its perplexity with 3.1147 to see if we are comfortable that it is sufficiently 2CWC !
(Confident where correct and correct where confident)

Caveat

I had to choose lower bounds on the predictions in order to get an upper bound on the perplexity. However, I wonder: if you have a model with low highest predictions, say below 0.5 on average, but high accuracy, say it’s correct 98.6% of the time, and if the correct label is almost always within the first 2 or 3 highest predictions, that would still be a reliable model but with a high perplexity.

I (pseudo)defined a way to measure confidence and self awareness in a model, but what I was really after is what I just mentioned: a model whose correct predictions are almost always within the top FEW predictions, even if THE top prediction is incorrect much of the time. Perhaps another measure should be introduced, in addition to accuracy and perplexity, to cover the case of good top predictions but low confidence.

In my example of wanting to display images containing only your company’s products or logo, that’s all you need actually. Even if your model has low accuracy (for the top prediction), and low prediction values, i.e. low confidence, but high accuracy among the top 2 or 3 predictions, then you can select the images whose top 3 predictions are all your company’s, with confidence.

Perplexity is a good way to measure confidence, which I called 2CWC, but confidence may not be necessary for your needs, in which case you need to look at more than perplexity.

References:

Referenced sources:

(1) ILSVRC () imagenet competition: http://www.image-net.org/challenges/LSVRC/

Sources on entropy, cross-entropy, and perplexity:

Article: Entropy, Perplexity and Its Applications: https://leimao.github.io/blog/Entropy-Perplexity/

A short piece on perplexity for classification: https://jamesmccaffrey.wordpress.com/2016/08/16/what-is-machine-learning-perplexity/

The wikipedia article on perplexity: https://en.wikipedia.org/wiki/Perplexity#References

Documents by Dan Jurafsky, introducing perplexity but focused on NLP:

Stanford presentation: Language Models, Introduction to N-grams: https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

Book: Speech and Language Processing — An Introduction to Natural Language Processing,Computational Linguistics, and Speech Recognition: https://web.stanford.edu/~jurafsky/slp3/ed3book_dec302020.pdf