Perplexity: Your PP Metric

Published in

unpack

3 min readFeb 1, 2021

People these days are crazy about Natural Language Processing (NLP), because it has marked an evolutionary accomplishment in the AI field. People usually adopt it in speech recognition, spelling or grammatical error correction, machine translation and so on. But when it comes to evaluation of language models in NLP, many AI experts find it taxing. That is why perplexity comes to the rescue.

Before we jump into the perplexity, let’s have a brief look at how language models work. Like most machine learning (ML) models, a language model will still go through all the training, validation and testing process; however, the data set in this case are the given corpus. After tokenisation, all the sentences from the corpus will be separated into individual words and orderly sorted into a list. Then, the model computes the possibility that a certain word follows a given word or phrase. Given a sentence “I’d like to eat ______” to fill the blank, the possibility for the next word after ‘eat’, in common sense, is higher with ‘pizza’ than with ‘and’. By doing so, we could predict what the next word or phrase following the previous word in some technologies, like spelling check or autocomplete, and in some cases, articles could be generated automatically. By all means, the goal of a good language model is to predict or generate the next word that makes linguistic sense as possible after model training.

So, what role does perplexity play in a learning model? Simply put, it acts as an evaluation metric. Ideally, we should evaluate the performance of a language model by running the model in an application and see how much of the improvement it brings in comparison of the other model, which is called extrinsic evaluation, though the downside of such method is the cost. It could be expensive and time-consuming to conduct an extrinsic evaluation for any changes made in the model. Thus, we look the other side, an intrinsic evaluation, and this is how perplexity comes in.

Like other ML models, language models need a metric to evaluate the performance, where we use perplexity metric. Assuming that a language model is a probability matrix between a word and the next word that occurs in the corpus of the training set, Perplexity, known as PP, is “the inverse probability of the test set, normalised by the number of words”. In the Perplexity equation below, there are N words in a sentence, and each word is represented as w, where P is the probability of each w after the previous one. Also, we can expand the probability of W using the chain rule as followed.

Given the equation above, the more accurate the language model, the higher the probability of the word sequence, the lower the perplexity. In other words, we try to minimise the value of PP(W) to get a better model.

Perplexity: Your PP Metric

Written by Jinbin