All About N-gram Language Models

Published in

MLPurdue

10 min readFeb 13, 2023

Introduction

With large language models like OpenAI’s ChatGPT and Google’s Bard in the spotlight, it is easy to overlook much simpler language models (that can work quite well!). One such family of language models is n-gram.

For humans, it is quite easy to predict the next word someone is about to say. As an example, think about what word most likely comes after:

After I woke up in the…

You probably came up with something like morning or afternoon depending on your preference for waking up, or even the word bed, but likely not the or microwave.

This task of prediction can also be accomplished by language models!

A language model (LM) in the context of Natural Language Processing (NLP) is a statistical model that assigns probabilities to many sequences of words.

Conceptually, such probabilities are meant to represent the likeliness of a string of words in a particular language.

So, for the example earlier, the LM would assign higher probabilities to the word sequences After I woke up in the [morning / afternoon / bed] and lower probabilities to others like After I woke up in the [the / microwave].

Apart from prediction, language models are also skilled at speech recognition, spelling / grammar correction, or machine translation. Let us illustrate how language models might be helpful in these tasks. Firstly, for speech recognition, it is helpful to know that piece of cake is more probable than peace of cake to recognize that someone said Here is a piece of cake and not Here is a peace of cake. For spelling / grammar correction, we need to identify and correct errors like Their is a great view from the top of a mountain, which requires knowing that There is more probable than Their is. Lastly, as part of a machine translation task, we could have a set of potential translations of a sentence from another language in English. Knowing the relative probabilities of each sentence suggests which one is the best translation.

One of the earliest forms of language models is the n-gram model. While they are much simpler than the state-of-the-art models, they are an important tool for understanding the fundamental concepts of language modeling.

N-gram Models

An n-gram is a sequence of n words. A 2-gram, commonly known as a bigram, is a two-word sequence like after I, I woke, and woke up. A 3-gram, commonly known as a trigram, is a three-word sequence like after I woke, I woke up, and woke up in.

Before we dive into the mathematical details of n-gram models, let’s talk about the probability of a word w given some history h: P(w | h).

P(cool | Natural Language Processing is)

One way we can estimate this probability is from relative frequency counts. This can be done by taking a large corpus (a collection of text) and counting the frequency of Natural Language Processing is and the frequency that cool follows this phrase. In other words, we want to know how many times the history h was followed by the word w.

Now, if we wanted to know the joint probability of the entire sentence Natural Language Processing is cool, we would have to count all occurrences of this sentence and divide it by all possible five word sequences. That seems like a lot!

A more clever way to estimating a joint probability of n words is using the chain rule of probability:

Applying this to the sentence from earlier:

P(Natural Language Processing is cool) = P(Natural) P(Language | Natural) P(Processing | Natural Language) P(is | Natural Language Processing) P(cool | Natural Language Processing is)

Essentially, we are multiplying conditional probabilities of a word given the prior words (history) for each word in the sentence to find the joint probability.

With an enormous corpus, we can compute counts and thereby estimate the probabilities. While this process can work in many cases, since language is creative and ever-changing, new sentences and combinations of words are created very often, making it hard to procure good estimates of these probabilities. This means we would need an unreasonable amount of data for useful estimates! This is obviously not feasible.

The solution? Make simplifying assumptions. And that’s where n-gram models come in.

The main intuition behind the n-gram model is that instead of computing P(w | h), we can approximate the history by the last few words. More specifically, the last n-1 words.

For example, the bigram model approximates the probability of a word given all prior words by only using the conditional probability of the preceding word. So, instead of computing the probability of:

P(cool | Natural Language Processing is)

We approximate it by:

P(is | cool)

Essentially, when we use a bigram model to predict the conditional probability of the next word, we are making the following approximation:

Generalizing this to any n-gram model, we make the following approximation where N is the size (N=2 in bigrams and N=3 in trigrams):

We estimate these new simplified conditional probabilities in the same way described earlier.

One thing to note is that in unigram models (a.k.a. 1-gram), the joint probability is the product of the probability of each word, since there is no history to account for. This is equivalent to making the assumption of independence in probability and saying that the order of the words does not matter; clearly, this is not a helpful model for language!

Let us now work through a small example with a mini-corpus of two sentences. We need to augment each sentence with symbols like START and END so that we have the context for each bigram.

START An N-gram model is cool END
START NLP is fun END
START An N-gram model is used in NLP END

What would the probability be for the sentence NLP is cool?

P(NLP is cool) = P(NLP | START) P(is | NLP) P(cool | is) = 0.5 * 0.5 * 1 = 0.25

What about Machine learning is fun?

We never see Machine learning in our corpus. Our probability estimation would be 0 when clearly this sentence is a plausible one. This is unfavorable also because we cannot divide by 0. As this is a common problem that we deal with in statistical models like n-grams, we would like them to generalize to new unseen sentences.

Generalization

An important quality of models in Machine Learning is generalizing because we want to have better predictions in real-world applications. Here are some ways to help n-gram models generalize better.

Unknown Words

We can keep a fixed word list, called a vocabulary. Then convert any word not in the corpus to a symbol like UNKNOWN. And then estimate the probabilities for UNKNOWN from its counts like any other word in the corpus. This way we won’t encounter as many 0s.

An alternative to a predefined vocabulary is to replace all words that occur fewer than n times in the corpus with UNKNOWN.

Smoothing

While the technique described prior deals with unknown words, there still remains the problem of words in the vocabulary but those that appear in an unseen context, i.e. words with history not seen before. A common and simply way to address this is to apply a technique called smoothing.

There are many types of smoothing, but we will discuss Laplace and Add-k smoothing in this article.

Laplace smoothing means adding one to all the counts in the probability estimations.

c(i) is the count of word w, N is the total number of words, and V is the size of the vocabulary

Add-k smoothing is similar to Laplace, but instead of adding one, we add a fractional count k.

Other techniques include discounting, interpolation, and back-off.

Now that we have learned what n-gram models are and how we can help them generalize better, we can move onto the final portion of evaluating different n-gram models to help us choose the best n.

Evaluation

Now that we have a language model, how can we tell if it’s a good model? And if we have multiple language models, how can we tell which one is a better choice?

One way to evaluate a model is trying to use it for the task at hand and observing its performance. This is known as qualitative evaluation. For example, one could try to generate Shakespeare after using a corpus of Shakespeare’s works.

The best way to judge the performance of a language model is to integrate it with an application and measure its improvement. This is called extrinsic evaluation and is part of quantitative evaluation.

Unfortunately, doing this is often expensive. A metric to quickly and quantitatively evaluate the performance of a language model seems like a more attractive option. An intrinsic evaluation metric is one such way.

For intrinsic evaluations, we need a training and a test set. The training set or training corpus is the corpus that we use to compute the probabilities and, the unseen data we use to measure the quality of the n-gram model is the test set or test corpus. We can then compare two models by comparing their performance on this test corpus using a metric.

A commonly used metric is perplexity. For a test set W consisting of N words, it is defined as follows:

We compute the joint probability using the chain rule described earlier in this article based on the n-gram model used.

For the NLP is cool example, recall that the joint probability was 0.25. If we assume this is our test set, its perplexity value would be

A model with a lower perplexity is better! Generally, as the n increases, perplexity is lower and it tends to plateau eventually. We can see this trend in the graph below.

But it is not as simple as that. It is also necessary to consider the number of parameters of the model, because as n increases, the number of parameters increase substantially.

As an illustration, let us consider unigrams, bigrams and trigrams. Also, let us assume the size of our vocabulary is V. This means that in a unigram model, we have V parameters because that’s how many 1-gram word possibilities there are. Using the same reasoning, in bigram and trigram models, we would have V x V and V x V x V parameters because that’s how many 2-gram word-pair and 3-gram word-triplet possibilities exist respectively. As you can see, we can get more parameters very quickly.

Too many parameters means a more complex model and that can lead to over-fitting, i.e. bad news for generalization. So, it is important to have a balance between perplexity and the parameters in an n-gram model!

Limitations

Considering the relative simplicity of n-gram models, it is needless to say that there are limitations to them.

The probability estimates depend largely on the occurrence of words in the training corpus, so this is not desirable in circumstances where the test corpus looks vastly different. For example, an n-gram model trained on works of Shakespeare would not generate good predictions for the works of another playwright or poem.
N-grams can have too many parameters as mentioned before. This means in order to use a better performing n-gram, which usually has a higher n, we must choose one with more parameters, especially when using larger corpora for better probability estimates. This often requires better feature selection approaches.
N-gram models struggle to capture longer-distance context clues. It has been shown that after 6-grams, the gain in performance is limited. This is unfavorable in tasks where that is a particularly desirable feature and necessity.

Conclusion

Now that you know all about n-gram language models and their applications, I recommend exploring mini-projects if you found this interesting and let me know if you do. Here is a step-by-step implementation of n-gram models in Python. And here is an application of n-grams in text generation for you to get started!

I hope you learned more about n-gram models through this article! I’d love to hear your thoughts and questions, so feel free to comment below. 😊