Using Perplexity to Evaluate a Natural-Language Model

David Masse
8 min readSep 11, 2018

Having built a word-prediction model (please see link below), one might ask how well it works.

https://medium.com/@idontneedtoseethat/predicting-the-next-word-back-off-language-modeling-8db607444ba9

The simplest answer, as with most machine learning, is accuracy on a test set, i.e. the percentage of the time the model predicts the the nth word (i.e. the last word or completion) of n-grams (from the same corpus but not used in training the model), given the first n-1 words (i.e the prefix) of each n-gram. In the case of stupid backoff, the model actually generates a list of predicted completions for each test prefix. We can see whether the test completion matches the top-ranked predicted completion (top-1 accuracy) or use a looser metric: is the actual test completion in the top-3-ranked predicted completions? Is the right answer in the top 10?

These accuracies naturally increase the more training data is used, so this time I took a sample of 100,000 lines of news articles (from the SwiftKey-provided corpus), reserving 25% of them to draw upon for test cases. The training text was count vectorized into 1-, 2-, 3-, 4- and 5-grams (of which there were 12,628,355 instances, including repeats) and then pruned to keep only those n-grams that appeared more than twice. This still left 31,950 unique 1-grams, 126,906 unique 2-grams, 77,099 unique 3-grams, 19,655 unique 4-grams and 3,859 unique 5-grams. The test set was count-vectorized only into 5-grams that appeared more than once (3,629 unique 5-grams). The final word of a 5-gram that appears more than once in the test set is a bit easier to predict than that of a 5-gram that appears only once (evidence that it is more rare in general), but I think the case is still illustrative.

The below shows the selection of 75 test 5-grams (only 75 because it takes about 6 minutes to evaluate each one). The next block of code splits off the last word of each 5-gram and checks whether the model predicts the actual completion as its top choice, as one of its top-3 predictions or one of its top-10 predictions. Accuracy is quite good (44%, 53% and 72%, respectively) as language models go since the corpus has fairly uniform news-related prose.

It’s worth noting that when the model fails, it fails spectacularly. The average prediction rank of the actual completion was 588 despite a mode of 1. This is because, if, for example, the last word of the prefix has never been seen, the predictions will simply be the most common 1-grams in the training data. I have not addressed smoothing, so three completions had never been seen before and were assigned a probability of zero (i.e. had no rank).

These measures are extrinsic to the model — they come from comparing the model’s predictions, given prefixes, to actual completions. Thanks to information theory, however, we can measure the model intrinsically. (See Claude Shannon’s seminal 1948 paper, A Mathematical Theory of Communication.) We can answer not just how well the model does with particular test prefixes (comparing predictions to actual completions), but also how uncertain it is given particular test prefixes (i.e. unlabeled data).

To understand this we could think about the case where the model predicts all of the training 1-grams (let’s say there is M of them) with equal probability. We could place all of the 1-grams in a binary tree, and then by asking log (base 2) of M questions of someone who knew the actual completion, we could find the correct prediction. This quantity (log base 2 of M) is known as entropy (symbol H) and in general is defined as H = - ∑ (p_i * log(p_i)) where i goes from 1 to M and p_i is the predicted probability score for 1-gram i. (If p_i is always 1/M, we have H = -∑((1/M) * log(1/M)) for i from 1 to M. This is just M * -((1/M) * log(1/M)), which simplifies to -log(1/M), which further simplifies to log(M).) Entropy is expressed in bits (if the log chosen is base 2) since it is the number of yes/no questions needed to identify a word.

If some of the p_i values are higher than others, entropy goes down since we can structure the binary tree to place more common words in the top layers, thus finding them faster as we ask questions. (Mathematically, the p_i term dominates the log(p_i) term, i.e. p_i * log(p_i) tends to 0 as p_i tends to zero, so lower p_i symbols don’t contribute much to H while higher p_i symbols with p_i closer to 1 are multiplied by a log(p_i) that is reasonably close to zero.)

To encapsulate uncertainty of the model, we can use a metric called perplexity, which is simply 2 raised to the power H, as calculated for a given test prefix. We can then take the average perplexity over the test prefixes to evaluate our model (as compared to models trained under similar conditions). In our special case of equal probabilities assigned to each prediction, perplexity would be 2^log(M), i.e. just M. This means that perplexity is at most M, i.e. the model is “M-ways uncertain.” It can’t make a choice among M alternatives. If the probabilities are less uniformly distributed, entropy (H) and thus perplexity is lower. For our model below, average entropy was just over 5, so average perplexity was 160. On average, the model was uncertain among 160 alternative predictions, which is quite good for natural-language models, again due to the uniformity of the domain of our corpus (news collected within a year or two).

num_test_grams = len(test_ngrams)
import random
r = random.sample(range(num_test_grams), 75)
test_gram_sample = [test_ngrams[i] for i in r]
entropies = []
top1 = []
top3 = []
top10 = []
ranks = []
for gram in test_gram_sample:
tgs = gram.split(' ')
ending = tgs[4]
pref = ' '.join(tgs[0:4])
so = score_output(pref, fact = 0.4)
entropy = sum(-1 * so * np.log2(so))
entropies.append(entropy)
top1.append(so.index[0] == ending)
top3.append(ending in so.index[0:3])
top10.append(ending in so.index[0:10])
if ending in bf.index:
ranks.append(so.index.get_loc(ending) + 1)
perplexities = []
for ent in entropies:
perplexities.append(2**ent)

Below, for reference is the code used to generate the model:

# Python library imports:
import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize, WhitespaceTokenizer, TweetTokenizer
np.random.seed(seed=234)
# The below reads in N lines of text from the 40-million-word news corpus I used (provided by SwiftKey for educational purposes) and divides it into training and test text.
N = 100000
with open("news.txt") as myfile:
all_articles = [next(myfile) for x in range(N)]
articles = all_articles[0:75000]
test_articles = all_articles[75000:N]
joined_articles = [" ".join(articles)]
joined_test_articles = [" ".join(test_articles)]
# The below takes out apostrophes (don't becomes dont), replacing anything that's not a letter with a space. Any single letter that is not the pronoun "I" or the article "a" is also replaced with a space, even at the beginning or end of a document.
def clean_article(article):
art0 = re.sub("'", '', article)
art1 = re.sub("[^A-Za-z]", ' ', art0)
art2 = re.sub("\s[B-HJ-Zb-hj-z]\s", ' ', art1)
art3 = re.sub("^[B-HJ-Zb-hj-z]\s", ' ', art2)
art4 = re.sub("\s[B-HJ-Zb-hj-z]$", ' ', art3)
return art4.lower()
# The below breaks up the training words into n-grams of length 1 to 5 and puts their counts into a Pandas dataframe with the n-grams as column names. The maximum number of n-grams can be specified if a large corpus is being used. The penultimate line can be used to limit the n-grams used to those with a count over a cutoff value.
ngram_bow = CountVectorizer(stop_words = None, preprocessor = clean_article, tokenizer = WhitespaceTokenizer().tokenize, ngram_range=(1,5), max_features = None, max_df = 1.0, min_df = 1, binary = False)
ngram_count_sparse = ngram_bow.fit_transform(joined_articles)
ngram_count = pd.DataFrame(ngram_count_sparse.toarray())
ngram_count.columns = ngram_bow.get_feature_names()
sums = ngram_count.sum(axis = 0)
sums = sums[sums > 2]
ngrams = list(sums.index.values)
# The below similarly breaks up the test words into n-grams of length 5.
test_bow = CountVectorizer(stop_words = None, preprocessor = clean_article, tokenizer = WhitespaceTokenizer().tokenize, ngram_range=(5,5), max_features = None, max_df = 1.0, min_df = 1, binary = False)
ngram_count_sparse_test = test_bow.fit_transform(joined_test_articles)
test_ngram_count = pd.DataFrame(ngram_count_sparse_test.toarray())
test_ngram_count.columns = test_bow.get_feature_names()
test_sums = test_ngram_count.sum(axis = 0)
test_sums = test_sums[test_sums > 1]
test_ngrams = list(test_sums.index.values)
# The helper functions below give the number of occurrences of n-grams in order to explore and calculate frequencies
def number_of_unique_ngrams(n, ngrams):
grams = 0
for ng in ngrams:
ng_split = ng.split(" ")
if len(ng_split) == n:
grams += 1
return grams
def number_of_ngrams(n, ngrams):
grams = 0
for ng in ngrams:
ng_split = ng.split(" ")
if len(ng_split) == n:
grams += sums[ng]
return grams
def base_freq(n, ngrams):
tot_ngrams = number_of_ngrams(n, ngrams)
freqs = pd.Series()
for ng in ngrams:
ng_split = ng.split(" ")
if len(ng_split) == n:
freqs[ng] = sums[ng] / tot_ngrams
return freqs
# For use in later functions so as not to re-calculate multiple times:
bf = base_freq(1, ngrams)
# The function below finds any n-grams that are completions of a given prefix phrase with a specified number (could be zero) of words 'chopped' off the beginning. For each, it calculates the count ratio of the completion to the (chopped) prefix, tabulating them in a series to be returned by the function. If the number of chops equals the number of words in the prefix (i.e. all prefix words are chopped), the 1-gram base frequencies are returned.
def find_completion_scores(prefix, chops, factor = 0.4):
cs = pd.Series()
prefix_split = prefix.split(" ")
l = len(prefix_split)
prefix_split_chopped = prefix_split[chops:l]
new_l = l - chops
if new_l == 0:
return factor**chops * bf
prefix_chopped = ' '.join(prefix_split_chopped)
for ng in ngrams:
ng_split = ng.split(" ")
if (len(ng_split) == new_l + 1) and (ng_split[0:new_l] == prefix_split_chopped):
cs[ng_split[-1]] = factor**chops * sums[ng] / sums[prefix_chopped]
return cs
# The below tries different numbers of 'chops' up to the length of the prefix to come up with a (still unordered) combined list of scores for potential completions of the prefix.
def score_given(given, fact = 0.4):
sg = pd.Series()
given_split = given.split(" ")
given_length = len(given_split)
for i in range(given_length+1):
fcs = find_completion_scores(given, i, fact)
for i in fcs.index:
if i not in sg.index:
sg[i] = fcs[i]
return sg
#The below takes the potential completion scores, puts them in descending order and re-normalizes them as a pseudo-probability (from 0 to 1).
def score_output(given, fact = 0.4):
sg = score_given(given, fact)
ss = sg.sum()
sg = sg / ss
sg.sort_values(axis=0, ascending=False, inplace=True)
return sg

--

--