Natural Language Processing

N-gram language models

Part 3: Optimize model interpolation

Khanh Nguyen

Published in

MTI Technology

25 min readJul 7, 2020

To see the code I wrote for this project, you can check out its Github repo
For other parts of the project: part 1, part 2, part 3

Background

In previous parts of my project, I built different n-gram models to predict the probability of each word in a given text. This probability is estimated using an n-gram — a sequence of word of length n — which contains the word. The below formula shows how the probability of the word “dream” is estimated as part of the trigram “have a dream”:

The vertical line denotes the probability of “dream” given the previous words “have a”

We train the n-gram models on the book “A Game of Thrones” by George R. R. Martin (called train). We then evaluate the models on two texts: “A Clash of Kings” by the same author (called dev1), and “Gone with the Wind” — a book from a completely different author, genre, and time (called dev2).

The metric to evaluate the language model is average log likelihood: the average of the log probability that the model assigns to each word in the evaluation text.

N_eval: total number of words in the evaluation text

Often, log of base 2 is applied to each probability, as is the case in the first two parts of the project. Nevertheless, in this part, I will use natural log, as it makes it simpler to derive the formulas that we will be using.

Problem

In part 2, the various n-gram models — from unigram to 5-gram — were evaluated on the evaluation texts (dev1 and dev2, see graphs below).

From this, we notice that:

Bigram model perform slightly better than unigram model. This is because the previous word to the bigram can provide important context to predict the probability of the next word.
Surprisingly, trigram model and up are much worse than bigram or unigram models. This is largely due to the high number of trigrams, 4-grams, and 5-grams that appear in the evaluation texts but nowhere in the training text. Hence, their predicted probability is zero.
For most n-gram models, their performance is slightly improved when we interpolate their predicted probabilities with the uniform model. This seems rather counter-intuitive, since the uniform model simply assigns equal probability to every word. However, as explained in part 1, interpolating with this “dumb” model reduces the overfit and variance of the n-gram models, helping them generalize better to the evaluation texts.

In this part of the project, we can extend model interpolation even further: instead of separately combining each n-gram model with the uniform, we can interpolate different n-gram models with one another, along with the uniform model.

What to interpolate?

The first question to ask when interpolating multiple models together is:

Which n-gram models should we interpolate together to give the best result on the evaluation texts?

To answer this question, we use the simple strategy outlined below:

First, we start with the uniform model. This model will have very low average log likelihoods on the evaluation texts, since it assigns every word in the text the same probability.
Next, we interpolate this uniform model with the unigram model and re-evaluate it on the evaluation texts. We naively assume that the models will have equal contribution to the interpolated model. As a result, each model will have the same interpolation weight of 1/2.
We then add the bigram model to the mix. Similarly, in this 3-model interpolation, each model will simply have the same interpolation weight of 1/3.
We keep adding higher n-gram models to the mix, while keeping the mixture weights the same across models, until we reach the 5-gram model. After each addition, the combined model will be evaluated against the two evaluation texts, dev1 and dev2.

Coding the interpolation

In part 2, each evaluation text had a corresponding probability matrix. This matrix has 6 columns — one for each model — and each row of the matrix represents the probability estimates of each word under the 6 models. However, since we want to optimize the model performance on both evaluation texts, we will vertically concatenate these probability matrices into one big evaluation probability matrix (803176 rows × 6 columns):

All probability matrices are represented as numpy arrays. Each probability in the matrix was estimated from the training text “Game of Thrones”.

Given this combined matrix, the average log likelihood of each interpolated model is easily done via the following steps:

Add the relevant columns of the matrix with equal weights to get a single interpolated probability for each row/word
Average the log of these probabilities across rows/words

For example, here is the code for calculating the average log likelihood of the 3-model interpolation (uniform + unigram + bigram), which came out to be -6.26:

Result

Numbers in x-axis indicate the length of the n-gram (e.g. 1 = unigram)

When more and more models are interpolated into the mix, we see that:

The average log likelihood of the training text always increases. This is consistent with our observation in part 2, in which longer n-grams provide important contexts to predict the probability of the current word in the training text itself.
In contrast, for evaluation texts — dev1, dev2, and the combined dev — the performance of the interpolated model only improves when the unigram and bigram models are added to the mix. Adding the trigram, 4-gram, and 5-gram models do not improve the interpolated model in anyway, and can actually degrade the model performance. This is consistent with the terrible performances of these models when evaluated separately in part 2.
The relative performance of these models — unigram and bigram good, higher n-grams bad — suggests that we should not naively combine these models with equal weights. Rather, we should optimize the interpolation weights of these models to give the best result on the combined evaluation text.

How to interpolate?

Grid search

Finding the optimal interpolation weights can be done naively as a grid search:

We try different interpolation weights and see which ones have the highest average log likelihood on the combined evaluation text.

However, there are 2 major disadvantages with grid search for this problem:

Grid search on 6 model hyper-parameters (one for each model weight) can be computationally expensive, given the high number of model weight permutations we have to evaluate.
The model hyper-parameters are often independent in grid search (for example, gamma and C for SVM with radial basis kernel). In contrast, our interpolation weights have to always sum to 1 to ensure the interpolated probabilities follow a valid probability distribution. This can make it more tricky to choose the candidate interpolation weights for evaluation.

Optimization algorithms

Alternatively, we can find the optimal interpolation weights using optimization algorithms:

We use an algorithm to automatically find the interpolation weights with the highest average log likelihood on the combined evaluation text.

To set up the algorithm, the objective function and constraint are written below:

We maximize the average log probability across words in the combined evaluation text

where:

N_word: total number of words in the combined evaluation text
P̂(word): interpolated probability of each word
a_i: interpolation weight of model i, from uniform (i=0) to 5-gram (i=5)
P_i(word): estimated probability of each word under model i

In the next section, I will go over two algorithms to solve this optimization problem: gradient descent, and the expectation-maximization (EM) algorithm.

Gradient descent

Gradient descent is typically used for unconstrained optimization problems. In contrast, our optimization problem has an additional constraint: all 6 interpolation weights have to sum to one.

Nevertheless, we can incorporate this constraint into the objective function by replacing the weight of the uniform model by 1 minus the weights of the other n-gram models:

As a result, the objective function becomes an unconstrained optimization problem of the n-gram model weights:

We can then use gradient descent to update each n-gram model weight:

1. Initialize each n-gram model weight with some value between 0 and 1, while ensuring that their sum is still lower than 1. For example, we can initialize the 6 models with equal weights of 1/6 each.

2. Differentiate of the objective function with respect to each n-gram model weight (a₁ to a₅). This derivative is often called the gradient of the objective function with respect to that weight.

3. Update each model weight by some fraction of its gradient from step 2. This fraction often goes by the name of learning rate (λ), since it controls how fast the model weights are updated in each step.

The weight of the uniform model can then be updated by subtracting the updated n-gram model weights from 1:

Finally, we repeat step 2 and 3 until our average log likelihood converges to the maximum value.

The above 3 steps for gradient descent are implemented in the Python function optimize_gd below. It takes in the probability matrix, a specified learning rate, the number of iterations that we want to run our gradient descent, and optional initialized weights. It then returns the model weights after the gradient descent has finished:

Result for two-model interpolation

To visualize how the gradient descent algorithm works, we will use it to interpolate only 2 models together: the uniform and unigram models.

From the graph below, we see that the average log likelihood of the combined evaluation text reaches the maximum at about 15% of the uniform model + 85% of the unigram model.

Alternatively, these optimized weights can be automatically found using the above optimize_gd function:

optimize_gd(prob_matrix=dev_prob_matrix[:, [0, 1]],
            learning_rate=0.1,
            n_iter=10,
            init_weights=[0.1, 0.9])# array([0.14948199, 0.85051801])

The above result shows that, with initial weights of 0.9 uniform + 0.1 unigram, gradient descent arrives quite close to the optimal values after only 10 iterations (with a learning rate of 0.1). The animated plot below shows the progress of the algorithm during those first 10 iterations:

At the beginning, the gradient of the average log likelihood with respect to the unigram model weight is very high — the slope of the tangent line to the curve is quite steep. As a result, the unigram weight took large steps to the right, as it is updated by a fraction of its gradient.
However, as the unigram weight approaches its optimal value of 85%, the slope of the curve levels out. Hence, the unigram weight crawls more and more slowly as it gets closer to optimal value.

Now that we know how to use gradient descent to optimize the model weights, let’s move on to another optimization algorithm that can achieve the same result: the expectation-maximization (EM) algorithm.

Expectation-maximization (EM) algorithm

If you were to look at resources for EM algorithm on the Internet, you’d get the impression that the algorithm is extremely complex. However, at the core of the algorithm lies a very simple strategy!

To demonstrate this strategy, I will apply the algorithm to interpolate the previous uniform-unigram model. This consists of the following 4 steps:

1. Initialize values for the uniform and unigram weights. For this example, I again chose the initial unigram weight to be 0.1, which corresponds to a uniform weight of 0.9.

2. Construct a lower bound of the objective function (average log likelihood of the combined evaluation text). Furthermore, this lower bound needs to be tight against the objective function at the initial weights from step 1. In other words, the value of the objective function and the lower bound have to be equal at this point. This is often called the E-step of the algorithm, for reasons that will be explained later.

3. Find the model weights that optimize this lower bound, and update the model weights to these values. This is often called the M-step in the algorithm, whose name will again be explained later.

4. Repeat step 2 to construct a lower bound of the objective function at the updated weights, then step 3 to find the weights that maximize this new lower bound. We keep repeating steps 2 and 3 — E-step and M-step — until we converge to the maximum of the objective function.

Why must the lower bound be tight?

In the E-step, the constructed lower bound needs to be tight against the objective function at the current model weights. The graph below shows what can happen when the lower bound is tight (left panel) versus when it is not (right panel):

**A, B:** objective function at previous weights and updated weights. **A’, B’:** lower bound at previous weights and updated weights

When the lower bound is tight, the objective function at the updated weights (B) are always higher than that of the current weights (A).
In contrast, when the lower bound is not tight, the objective function at the updated weights might be lower than that of the current weights (A > B).

In short, constructing a tight lower bound ensures that the objective function always increases after each iteration of the algorithm i.e. that the algorithm will converge.

Now that we understand the high-level strategy of the EM algorithm, let’s dive deeper on how we can accomplish its 2 core steps: E-step and M-step.

E-step

In the E-step, we construct the lower bound of the objective function, which for our two-model example becomes:

This lower bound is constructed by first multiply and divide each term in the summation by factors of z. For our two-model interpolation, the factors are z₀ for the uniform model, and z₁ for the unigram model. Note that these factor are unique to each word in the evaluation text i.e. each word will have different z₀ and z₁ associated with it.

Written this way, we can see that the interpolated probability for each word is no longer an interpolation between the unigram and uniform probabilities — P₀ and P₁. Instead, it becomes an interpolation between a₀P₀/z₀ and a₁P₁/z₁, which are weighted by z₀ and z₁ respectively.

This rewriting of the objective function seems quite tortured, but it allows us to apply an elegant theorem called Jensen’s inequality. In simple terms, it states that:

A concave function applied to a weighted combination of two values, x₀ and x₁, is always greater or equal to the weighted combination of the function applied on each value, f(x₀) and f(x₁).

Think of concave functions as those with “frowning ☹️” curves, although a more rigorous definition can be found here. An example of such a function is the logarithm function. As a result, Jensen’s inequality for logarithm is illustrated below:

Adapting this inequality to each term of our modified objective function, we have:

As a result, this inequality helps us create a lower bound of the objective function, which is the final expression in the above derivation.

Lower bound as expectation

Finding the lower bound of the objective function using Jensen’s inequality is straightforward enough, but where does the name E-step come in to play?

As shown above, the lower bound of the objective function can be seen as the sum over words of weighted combination of two logarithms:

However, this weighted combination can also be viewed as the expectation (the “E” of E-step) of a random variable Z that can take two values:

log(a₀P₀/z₀) with probability z₀
log(a₁P₁/z₁) with probability z₁

As a result, each word will have a different distribution of Z depending on its z₀ and z₁. One question remains: how can we determine the probabilities z₀ and z₁ for each word?

Finding z₀ and z₁

We can answer this question by looking at the most important requirement of the lower bound: it must be tight against the objective function at the weights of the current iteration.

For our two-model problem, I will call the weight of the uniform model at iteration k as a₀ᵏ, and the weight of the unigram model at that iteration as a₁ᵏ. For the first iteration (k=0), these iterations weights are the weights that we initialized in step 1.

Given these iteration-k weights, our goal is to find z₀ and z₁ such that the objective function (left-hand side) is equal to the lower bound (right-hand side) at precisely these weights:

It turns out, for the left-hand side to equal the right-hand site, we need our z₀ and z₁ to be:

To see why, let’s plug in these values for z₀ and z₁ to the lower bound (RHS) to confirm that it becomes equal to the objective function (LHS):

Interpretation of z₀ and z₁

Given the above formula for z₀ and z₁, we can find a more intuitive explanation for why they take these value. Namely, for each word in the text:

z₀ is the posterior probability that the uniform model was the one generated that word, given the prior a₀ᵏ and the likelihood P₀
z₁ is the posterior probability that the unigram model was the one generated that word, given the prior a₁ᵏ and the likelihood P₁

This is simply Bayes’ theorem in action, and can be seen more clearly in the diagram below:

In other words:

z₀ and z₁ represent the probabilities that the uniform or unigram model was the one generating a given word, given our prior belief about the relative plausibility of each model at the current iteration (a₀ᵏ and a₁ᵏ), as well as the probability for that word to be generated from each model (P₀ and P₁).

Since both the priors and likelihoods are known, these posterior probabilities can be calculated for every single word using the above formulas. Once z₀ and z₁ are calculated, the lower bound of the objective function becomes a function of only a₀ and a₁, as can be seen in the earlier graph for E-step (see left).

M-step

After the lower bound of the objective function is constructed, we can find the model weights —a₀ and a₁ — that maximize this lower bound (hence the M in M-step).

For this, we can use the earlier trick of setting the uniform weight a₀ as 1 minus the n-gram model weights, which in this case is just 1 - a₁.

With only one variable to optimize, we can simply differentiate the lower bound with respect to a₁:

We can further simplify this expression by replacing z₀ with 1 - z₁, since the two posterior probabilities are complements of each other:

Setting this derivative to zero, we can directly find the value of a₁ that maximizes the lower bound:

The above result is quite interesting: despite the tedious derivations, the unigram weight that maximizes the lower bound is simply the average of the posterior probabilities for the unigram model! Deriving the uniform weight reveals a similar pattern:

As a result, we simply update the model weights to these lower bound-maximizing values:

Repeat E-step and M-step

These updated model weights above can be used to re-calculate the posterior probabilities z₀ and z₁ in the E-step that follows. This, in effect, creates a new lower bound at the updated weights, as illustrated earlier for step 4:

We keep repeating the E-step and M-step until the maximum value of the objective function is reached. The weights for the uniform and unigram model at the last iteration will be taken as the weights to interpolate the two models together, as they correspond to the highest average log likelihood on the combined evaluation text.

Extend EM algorithm for more than 2 models

Construct lower bound

When deriving the EM algorithm for more than 2 models, we end up with a similar strategy. Namely, given the uniform model (i=0) and n-gram models (from unigram: i=1 to 5-gram: i=5), we can construct a lower bound to the objective function using Jensen’s inequality for logarithm:

E-step

In the E-step, zⱼ is still the posterior probability that each word is generated from model j, given the priors aⱼᵏ and the likelihood Pⱼ. This ensures that the lower bound is tight against the objective function at the 6 current model weights of iteration k.

M-step

In the M-step, we again use the trick of setting the uniform weight a₀ as 1 minus the n-gram model weights:

Here is where it gets tricky, since the derivative of the lower bound with respect to the weight of model j (aⱼ), now also involves the weights of other n-gram models:

To resolve this, I will introduce a₀ back to the derivative and solve for aⱼ when setting the derivative to zero. Hence, the aⱼ that maximizes the lower bound is now a function of a₀ — the uniform weight at that maximum point.

We now run into a chicken-and-egg problem, since we need to know each aⱼ to calculate a₀: 1 minus the aⱼ’s from each of the n-gram model. Yet, the value of each aⱼ requires a₀ itself!

To get ourselves out of this conundrum, we can exploit the fact that all model weights always sum to 1. This allows us to calculate a₀ directly without using other aⱼ’s:

With a₀ known, we can use it to calculate the remaining aⱼ’s:

Finally, despite the extremely tedious derivation of the M-step for more than 2 models, we end up with the same result as the two-model case! Namely, for every model j, the weight that maximizes the lower bound is simply the average of the posterior probabilities zⱼ across all words in the evaluation text.

Coding the EM algorithm

This simple result means the EM algorithm is extremely easy to implement, despite its initial complexity:

The E-step calculates the posterior probabilities from the model weights of the current step.
The M-step uses those posterior probabilities to update the weights for the E-step that follows.

The function optimize_em below demonstrates how these steps are implemented. Note that the core loop of the EM algorithm can be accomplished in just 4 lines of code!

From an initial weight of 90% uniform + 10% unigram, the EM algorithm converges to the maximum values of 15% uniform + 85% unigram at 10 iterations:

optimize_em(prob_matrix=dev_prob_matrix[:, [0, 1]],
            n_iter=10,
            init_weights=[0.9, 0.1])# array([0.14950068, 0.85049932])

Compare EM to gradient descent

Given that gradient descent and EM algorithm both arrive at the optimal model interpolation weights, why should we pick EM over the standard gradient descent?

Left: gradient descent with learning rate of 0.1. Right: EM algorithm

There are two reasons for why EM is preferred over gradient descent, especially when interpolating probabilities:

1. We don’t need to set a learning rate for EM algorithm. In contrast, for gradient descent, a too small learning rate means the algorithm converges slowly, while a too large learning rate can make the algorithm unstable:

EM vs gradient descent (GD) for uniform-unigram interpolation. Models were initialized with equal weights.

At a low learning rate, such as the current rate of 0.1 (blue line), gradient descent converges more slowly than EM (black line). This can also be seen in the previous animation, in which EM has already converged around 5 iterations, while gradient descent arrived at the maximum point later, at nearly 10 iterations.
In contrast, at a high learning rate such as 0.4 (red line), gradient descent does not seem to converge: its average log likelihood keeps bouncing around, never quite reaching the maximum value.

2. The weights at each EM iteration are always between 0 and 1, and their sum is always 1.

Updated weights follow valid probability distribution

This is because the weights after each iteration is simply the average of the posterior probabilities across words. Therefore, they always conform to a valid probability distribution (see accompanying image).
In contrast, the weights in gradient descent can be updated so much that they go below 0 or above 1. These weights will still sum to 1, but their values are obviously nonsensical.

In fact, when scaling the interpolation to all 6 models, gradient descent often fails due to these nonsensical weights, especially at high learning rates:

EM vs gradient descent (GD) for interpolation of all 6 models (unigram + 5 n-gram). Models were initialized with equal weights.

For example, at the learning rate of 0.01, the average log likelihood disconnects before 30 iterations; this is because the negative weights cause some words to have negative interpolated probabilities, which are impossible to take the logarithm of! Reducing the learning rate can alleviate this problem, but the training becomes very slow.
In contrast, EM still happily converges to the maximum in way fewer than 10 iterations. This proves that EM is much better than gradient descent in interpolating probabilities.

Disadvantage of EM

On the other hand, the EM algorithm often returns local maxima in practice. Fortunately, this is not a problem in our case, since the objective function is concave. This can be seen from its second derivative with respect to each model weight aⱼ, which is always negative (the first derivative was derived earlier for gradient descent):

Since the objective function is concave, the optimized model weights returned from the EM algorithm are always global maxima: they are the best possible model weights that we can find to maximize the average log likelihood of the evaluation text.

Result

When we apply the EM algorithm to find the best interpolation weights among our 6 models (uniform + 5 n-gram models), we see that:

Models were initialized with equal weights (1/6)

The algorithm converges quite fast, as within 5 iterations, the average log likelihood of the combined evaluation text has essentially leveled out to the maximum value.
After 10 iterations, we obtain the corresponding weights for each model. Specifically, the best possible interpolation of our 6 models is one with:

15.6% of the uniform probability
34.4% of the unigram probability
43.1% of the bigram probability
6.5% of the trigram probability
0.4% of the 4-gram probability
0.01% of the 5-gram probability

Interpretation

This result aligns with our results from the previous parts, namely:

The lower n-gram models, such as the unigram model and especially the bigram model, are the best-performing model when evaluated separately. Therefore, it is no surprise that they have very large interpolation weights in our 6-model combination.
In contrast, the higher n-gram models — trigram, 4-gram, and 5-gram — were much worse, largely due to the large number of those n-grams that appear in the evaluation text but not in the training text. As a result, they have much lower weights in comparison.
Interpolating the n-gram models with the uniform model help them generalize better to the evaluation text. This corresponds to a small percentage of the uniform probability in the interpolation.

When compared with the interpolated model with equal weights (which were the initial weights of the EM algorithm) we see that the model with EM-optimized weights has improved the average log likelihood of the combined evaluation text from -6.52 to -6.15.

Testing EM-optimized model

Despite the optimistic result of the EM-optimized weights, these weights are learned from directly optimizing the average log likelihood of the evaluation texts themselves. As a result, we should evaluate these weights on a new set of texts to ensure that our EM algorithm did not just over-fit on those evaluation tests.

For this new set of text, I again chose a book that is similar to train (and dev1): “A Storm of Swords”, the third book of the Game of Thrones series by George R. R. Martin. In contrast, I also chose a book that is completely different from the training text: “Scarlett”, the sequel to “Gone with the Wind” (dev2).

**dev1, dev2:** texts on which EM was trained on. **test1, test2:** new texts to evaluate interpolated models

Fortunately, when the interpolated model with EM-optimized weights is applied to the new texts, their average log likelihoods also improve compared to the model with equal weights.

This makes sense, as we already know that the lower n-gram models perform much better than the higher n-gram models. Therefore, a model that puts more weight on the former compared to the latter is likely to perform better than a model that just assigns the same weights to all models, regardless of which text it is applied on.

Current landscape of language models

Varieties of n-gram models

The two methods that we have used thus far in this project: Laplace smoothing and model interpolation with expectation-maximization algorithm, are just two out of many methods of combining n-gram models.

For example, here are some smoothing techniques for n-gram models, whose names often curiously involve a pair of researchers: Good-Turing, Jelinek-Mercer, Church-Gale, Witten-Bell, or Kneser-Ney (currently the best smoothing technique). Similar to Laplace smoothing, these methods all try to deal with the problem of unknown n-grams in the evaluation text, and involve interpolating different n-gram models together.

Interpolated Kneser-Ney smoothing for bigrams (Source)

In contrast, an alternative to interpolation models are backoff models, such as Katz backoff and stupid backoff. These models deal with unknown n-grams not by interpolating n-gram probabilities together. Instead, if a higher n-gram is not found (say, a 4-gram), then they search for an existing shorter n-gram (such as a trigram) to estimate the probability, and so on backward.

Problems of n-gram models

Despite the proliferation of n-gram language models, they all face two serious problems:

1. They are big: for a given text, the number of unique n-grams (such as from unigram to 5-gram in our project) can be several times that of the total number of word in the text, as seen below for our training text:

783k unique n-grams vs 321k words in our training text

As a result, if a similar n-gram model is trained on the same data as BERT (2.5 billion words from Wikipedia + 800 million words from book corpus), it would have stored the counts of more than 8 billion unique n-grams. This is many times more parameters than even BERT-large, which only has 340 million parameters in total!

2. They are dumb: the n-gram model can only estimate the probabilities of the n-grams that it has seen. When it encounters a new n-gram, the model will completely break down, no matter how many n-grams there are in the model.

This is also called the sparsity problem of n-gram models, since no matter how large the training text is, the n-grams within it can never cover the seemingly infinite variations of n-grams in the English language.

Therefore, one could picture an n-gram language model as the Hulk: despite its gigantic size, it is deep down quite simplistic. It cannot, for example, give a good estimate for “Hulk crash” if the n-grams in the training text are all “Hulk smash”.

The rise of neural language models

Viewed another way, n-gram models are just counting the surface features of the English language, which are the n-grams that we see on the page. It has little understanding of the deeper features of natural language — such as semantics — that humans possess: we know that “Hulk crash” is quite possible given the repeated occurrences of the violent “Hulk smash”, even if we have never encountered such a phrase before.

**Left:** recurrent neural network (hidden state hᵢ, source). **Right:** BERT transformer (hidden state Oᵢ, source)

This limitation of n-gram model can help explain the current dominance of neural networks as language models, such as recurrent neural networks, or more recently transformers. Roughly speaking, these models try to imbue the complexity of natural language into latent features — hidden states in a neural network. As a result, a neural network will assign a much higher probability to “Hulk crash” than an n-gram model does, since “smash” and “crash” will be trained to have very similar hidden states.

Final remarks

Despite n-gram models being out-dated, I have learned so much about this most fundamental of language model throughout this project. Not only that, it has helped me connect the dots of many machine learning concepts, such as overfitting and the bias-variance trade-off, to the context of natural language processing.

Lastly, this project has taught me that while some concepts might appear highly complex at first, such as the expectation-maximization algorithm, they can have very intuitive explanations underneath. This really motivates me to keep learning more about other seemingly complicated machine learning concepts and breaking them down so that others can learn from them as well.

Reference

As mentioned in the previous parts of the project, those wishing to learn more about n-gram models are remiss if they skip the relevant chapter in Jurafsky & Martin’s NLP book. Many of my ideas in this project start out from that book.
Nevertheless, most NLP resources, including the book above, do not include any implementations for the Expectation-Maximization algorithm. On the other hand, most derivations of the algorithm are focused on applying it to Gaussian mixture models. Since we are mixing n-gram models, not Gaussians, the only instruction for EM algorithm on combining n-gram probabilities I could find is in this lecture note.
However, that lecture does not explain how each step in the EM algorithm is derived. Therefore, I had to derive all the steps on my own, with the help of this very readable lecture note from the famous CS229 Stanford class by Andrew Ng. If you notice anything amiss with my derivation, please feel free to leave your feedback to this blog post 🙏

Natural Language Processing

N-gram language models

Part 3: Optimize model interpolation

Background

Problem

What to interpolate?

Coding the interpolation

Result

How to interpolate?

Grid search

Optimization algorithms

Gradient descent

Result for two-model interpolation

Expectation-maximization (EM) algorithm

Why must the lower bound be tight?

E-step

Lower bound as expectation

Finding z₀ and z₁

Interpretation of z₀ and z₁

M-step

Repeat E-step and M-step

Extend EM algorithm for more than 2 models

Construct lower bound

E-step

M-step

Coding the EM algorithm

Compare EM to gradient descent

Disadvantage of EM

Result

Interpretation

Testing EM-optimized model

Current landscape of language models

Varieties of n-gram models

Problems of n-gram models

The rise of neural language models

Final remarks

Reference

Written by Khanh Nguyen