Perplexity calculation in NLP

3 min readMar 14, 2024

Perplexity is a measure used in natural language processing to evaluate how well a probabilistic model predicts a sample. It’s commonly used to assess the performance of language models. The perplexity (PP) of a language model for a given text is defined as the inverse probability of the test set, normalized by the number of words.

For a bigram model, the perplexity is calculated using the bigram probabilities of the test sentences. Here’s how you can compute it, step by step, with a simple example:

Example Corpus

Let’s consider a tiny sample corpus and a test set:

Training Corpus: “the cat sat on the mat the cat likes the mat”
Test Corpus: “the cat sat on the mat”

We will build a bigram model from the training corpus and then calculate the perplexity for the test corpus.

Steps to Calculate Perplexity

Build the Bigram Model: First, count the occurrences of all bigrams (pairs of consecutive words) and the individual words in the training corpus.
Calculate Bigram Probabilities: Use these counts to estimate the conditional probabilities of each bigram (i.e., the probability of the second word given the first word).
Compute Perplexity: Use the bigram probabilities from the model to calculate the perplexity for the test corpus, which involves taking the inverse probability of the test set, normalized by the number of words.

Calculation

Bigram Counts from Training Corpus:

the cat: 2
cat sat: 1
sat on: 1
on the: 1
the mat: 2
mat the: 1
cat likes: 1
likes the: 1

2. Unigram counts:

the: 4
cat: 2
sat: 1
on: 1
mat: 2
likes: 1

3. Bigram Probabilities:

4. Calculate Perplexity for the Test Corpus:

First, compute the probability of the entire test sentence:

Assuming the probability of seeing “the” as the first word is 1 (for simplicity, as we’re not given a specific initial distribution):

=1×0.5×0.5×1×1×0.5=0.125=1×0.5×0.5×1×1×0.5=0.125

So, the perplexity of the given test corpus with respect to the bigram model trained on the training corpus is approximately 2.378. This means, on average, the model is as confused as if it had to choose uniformly and independently among 2.378 words each time it makes a prediction. Lower perplexity indicates a better predictive model.

Where can you find more resources and support for NLP?

You can follow @AyushmanPranav on LinkedIn for updates and tips related to NLP.
If this article helped you. Support me by Buying me a coffee

Perplexity calculation in NLP

Example Corpus

Steps to Calculate Perplexity

Calculation

Where can you find more resources and support for NLP?

Written by Ayushman Pranav