In the previous articles 1 2, we learned how to calculate probability of a given sentence using n-gram language model. Today, let’s discuss how we calculate perplexity of a corpus using a language model. To be able to follow along, it is recommended that you go through the exercises in the previous articles.
Perplexity is the standard metric for measuring quality of a language model. Qualitatively, perplexity measures the average branching factor per token predicted by the language model. Let’s take a look at two extreme ends of the metric.
- perplexity of
1
: the language model is 100% certain predicting the next token. This occurs if language model is severely over-fit to the evaluation corpus. In practice, this should never happen. - perplexity of
V
whereV
is the size of the vocabulary: the language model assumes uniform distribution, i.e., completely random guess. If this is what we get from the language model, we might as well roll a die to predict the next token.
So, perplexity should lie somewhere between 1
and V
. A better language model will show lower perplexity in general.
Quantitatively, perplexity is given by
where q(s)
is the probability of sentence s
, n
is the number of sentences, and N
is the number of tokens in the corpus. Let’s calculate perplexity of our 2-gram model from before using two evaluation sentences. Create eval.txt
file with the following
that is not the question
that is that