Hidden Markov Models: The Secret Sauce in Natural Language Processing

16 min readDec 30, 2023

Part 8: Cracking the Decoding Problem in HMMs: A Viterbi Algorithm Primer

Decoding problem: In Hidden Markov Models (HMM) in Natural Language Processing (NLP), decoding refers to the process of determining the most probable sequence of hidden states given an observed sequence of words or an observed sequence of emissions. This is also known as the “decoding task”. The decoding problem in Hidden Markov Models (HMMs) involves finding the most probable sequence of hidden states that produced a given observation sequence. This is achieved using the Viterbi algorithm, which is a dynamic programming algorithm. In simple words, the Viterbi algorithm is used for decoding HMMs. Given a sequence of observations and a trained HMM, it aims to find the most probable sequence of hidden states that could have led to the observed sequence.

The Viterbi algorithm is a dynamic programming algorithm used to find the most likely sequence of hidden states, known as the Viterbi path, that results in a sequence of observed events. Suppose you have a sentence “Time flies like an arrow”. Your task is to tag each word with its part of speech. You could have many possible combinations of tags for this sentence. The Viterbi algorithm helps you find the sequence of tags that has the highest probability of being correct.

The goal is to maximize the likelihood of the observed sequence given the model parameters. In the context of an HMM, the observed sequence is the sequence of observed events or measurements, and the model parameters include the initial state probabilities, transition probabilities, and emission probabilities. The likelihood of the observed sequence given the model parameters is a measure of how probable the observed sequence is under the given model.

For example, in a weather prediction system, the observed sequence could be a series of temperature readings, and the hidden states could be different weather conditions (like sunny, or rainy). The model parameters would include the probabilities of transitioning from one weather condition to another, and the probabilities of observing certain temperatures given each weather condition. The goal would be to find the sequence of weather conditions that best explains the observed temperature readings.

Let’s visualize them:

And here’s the probability corresponding to each one of them.

If you notice carefully, you will find that we are computing many products more than once.

Without the Viterbi algorithm, we would have to calculate the probability of the entire sequence of states for all possible sequences of states, which would be computationally expensive as the number of possible sequences grows exponentially with the length of the sequence. However, with the Viterbi algorithm, we only need to keep track of the most probable sequence of states so far at each time step. At each time step, we compare the probability of transitioning from each possible previous state to each possible current state, and choose the transition that maximizes the total probability. This way, we only need to store the most probable sequence of states up to the current time step, making the algorithm much more efficient. By the end of the sequence, we have the most likely sequence of states that produced the observed sequence. Hence, the Viterbi algorithm is a powerful tool for handling long sequences in HMMs, breaking down the problem into simpler subproblems and using dynamic programming to efficiently find the most likely sequence of hidden states.

There are some symbols we should denote for the Viterbi algorithm before we dive deeper into the steps. Let’s explain of two important symbols used in the Viterbi algorithm:

In the Viterbi Algorithm, two matrices, delta and psi, are initialized. Both matrices are 2D matrix of size (number of states) x (number of observations)

Viterbi algorithm uses the transition probabilities and emission probabilities from the hidden Markov models to calculate two matrices. The matrix C (best_probs) holds the intermediate optimal probabilities and matrix D (best_paths), the indices of the visited states.

Delta Matrix: The Delta matrix, denoted as δ, stores the maximum probabilities for each state at each time step. These probabilities represent the highest likelihood of being in each state up to the current time step, considering all possible paths taken so far. Each entry delta[i][t] represents the maximum probability of being in state i at time t. The entries in delta are calculated as follows:

For the first observation, δ[i][1] is calculated as P(i) * P(o1|i), where P(i) is the initial state probability and P(o1|i) is the emission probability of the first observation given state i.

For subsequent observations, δ[i][t] is calculated as max(δ[j][t-1] * A[j][i] * P(ot| i)), where A[j][i] is the transition probability from state j to state i and P(ot|i) is the emission probability of the t-th observation given state i. The maximum is taken over all possible states j.

Psi Matrix: The psi matrix, denoted as ψ, stores the state that led to the maximum probability in the delta matrix for each state at each time step. These states represent the previous state that contributed most to the maximum probability for each state at each time step. The entry ψ[i][t] represents the state that contributed most to the maximum probability of reaching state i at time t.

Each entry ψ[i][t] represents the state that contributed most to the maximum probability of being in state i at time t. The entries in ψ are determined together with the entries in δ during the calculation of δ[i][t]:

For the first observation, ψ[i][1] is simply i, because there is only one state i.

For subsequent observations, ψ[i][t] is the state j such that δ[j][t-1] * A[j][i] * P(ot|i) is the maximum over all possible states j. This means that the state j that contributes most to the maximum probability of being in state i at time t is chosen as ψ[i][t]

Here is an example of how these matrices are filled in a simple scenario with three states (A, B, C) and three observations (1, 2, 3)

Initial state probabilities (Pi):

Pi = [0.6, 0.2, 0.2]

Transition probabilities (A):

A = [[0.7, 0.3, 0.0],

[0.4, 0.6, 0.0],

[0.0, 0.0, 1.0]]

Emission probabilities (B):

B = [[0.5, 0.5, 0.0],

[0.0, 0.5, 0.5],

[0.0, 0.0, 1.0]]

Observations:

O = [1, 2, 3]

First, initialize the Delta and Psi matrices:

Delta = [[0, 0, 0],

[0, 0, 0],

[0, 0, 0]]

Psi = [[0, 0, 0],

[0, 0, 0],

[0, 0, 0]]

Then, fill in the Delta and Psi matrices:

For the first observation (O[1] = 1):

Delta[1][1] = Pi[1] * B[1][1] = 0.6 * 0.5 = 0.3

Delta[2][1] = Pi[2] * B[2][1] = 0.2 * 0.5 = 0.1

Delta[3][1] = Pi[3] * B[3][1] = 0.2 * 0.5 = 0.1

Psi[1][1] = 1

Psi[2][1] = 2

Psi[3][1] = 3

After the first observation (O[1] = 1), the Delta and Psi matrices become:

Delta = [[0.3, 0.1, 0.1],

[0, 0, 0],

[0, 0, 0]]

Psi = [[1, 2, 3],

[0, 0, 0],

[0, 0, 0]]

Explanation: In the first step of the Viterbi algorithm, we calculate the initial probabilities for each state given the first observation. This is done by multiplying the initial state probabilities (Pi) with the emission probabilities (B) for the first observation.

Let’s break down the calculations for the first observation (O[1] = 1):

Delta[1][1] = Pi[1] * B[1][1]

This calculates the initial probability for state 1 at time 1. The probability of being in state 1 at time 1 is the product of the initial probability of state 1 (Pi[1]) and the emission probability of the first observation given state 1 (B[1][1]).

In this case, Pi[1] = 0.6 and B[1][1] = 0.5, so Delta[1][1] = 0.6 * 0.5 = 0.3.

Delta[2][1] = Pi[2] * B[2][1]

This calculates the initial probability for state 2 at time 1.

Following the same logic, Pi[2] = 0.2 and B[2][1] = 0.5, so Delta[2][1] = 0.2 * 0.5 = 0.1.

Delta[3][1] = Pi[3] * B[3][1]

This calculates the initial probability for state 3 at time 1. Here, Pi[3] = 0.2 and B[3][1] = 0.5, so Delta[3][1] = 0.2 * 0.5 = 0.1.

Next, we determine the state that led to the maximum probability in the Delta matrix for each state at time 1:

Psi[1][1] = 1

Since there’s only one state (state 1) at time 1, Psi[1][1] is simply 1.

Psi[2][1] = 2

At time 1, the maximum probability is Delta[2][1] = 0.1, which corresponds to state 2. Therefore, Psi[2][1] = 2.

Psi[3][1] = 3

Similarly, since Delta[3][1] = 0.1, the maximum probability corresponds to state 3.

Hence, Psi[3][1] = 3.

In summary, the initial probabilities for each state at time 1 are stored in the Delta matrix, and the state that led to the maximum probability for each state at time 1 is stored in the Psi matrix.

Note — When calculating the Delta and Psi matrices for the first observation, only the first row of each matrix is updated. The rest of the rows remain as zeros because the calculations for those states depend on the next observation.

Here’s why:

For each state at time 1, we calculate the initial probability as the product of the initial state probability (Pi) and the emission probability (B) for the first observation. This gives us the first row of the Delta matrix.
To calculate the Psi matrix, we need to know the state that leads to the maximum probability for each state at time 1. Since we only have one observation so far, we don’t have enough information to make this determination. That’s why the rest of the rows in the Psi matrix remain as zeros.

Once we have more observations, we can fill in the rest of the Delta and Psi matrices. For each subsequent observation, we calculate the maximum probability for each state by considering all possible previous states, and then we update the Psi matrix accordingly.

For the second observation (O[2] = 2):

Delta[1][2] = max(Delta[1][1] * A[1][1] * B[1][2], Delta[2][1] * A[2][1] * B[1][2], Delta[3][1] * A[3][1] * B[1][2]) = max(0.3 * 0.7 * 0.5, 0.1 * 0.4 * 0.5, 0.1 * 0 * 0.5) = 0.105

Delta[2][2] = max(Delta[1][1] * A[1][2] * B[2][2], Delta[2][1] * A[2][2] * B[2][2], Delta[3][1] * A[3][2] * B[2][2]) = max(0.3 * 0.3 * 0.5, 0.1 * 0.6 * 0.5, 0.1 * 0 * 0.5) = 0.075

Delta[3][2] = max(Delta[1][1] * A[1][3] * B[3][2], Delta[2][1] * A[2][3] * B[3][2], Delta[3][1] * A[3][3] * B[3][2]) = max(0.3 * 0 * 0.5, 0.1 * 0 * 0.5, 0.1 * 1 * 0.5) = 0.05

Psi[1][2] = argmax(Delta[1][1] * A[1][1] * B[1][2], Delta[2][1] * A[2][1] * B[1][2], Delta[3][1] * A[3][1] * B[1][2]) = 1

Psi[2][2] = argmax(Delta[1][1] * A[1][2] * B[2][2], Delta[2][1] * A[2][2] * B[2][2], Delta[3][1] * A[3][2] * B[2][2]) = 2

Psi[3][2] = argmax(Delta[1][1] * A[1][3] * B[3][2], Delta[2][1] * A[2][3] * B[3][2], Delta[3][1] * A[3][3] * B[3][2]) = 2

After the second observation (O[2] = 2), the Delta and Psi matrices become:

Delta = [[0.3, 0.1, 0.1],

[0.105, 0.075, 0],

[0, 0, 0.05]]

Psi = [[1, 2, 3],

[2, 2, 0],

[0, 0, 2]]

Explanation: In the second step of the Viterbi algorithm, we calculate the maximum probabilities for each state at time 2, given the second observation. This is done by taking the maximum over all possible previous states, of the product of the previous state’s maximum probability, the transition probability from the previous state to the current state, and the emission probability of the second observation given the current state.

Let’s break down the calculations for the second observation (O[2] = 2):

Delta[1][2] = max(Delta[1][1] * A[1][1] * B[1][2], Delta[2][1] * A[2][1] * B[1][2], Delta[3][1] * A[3][1] * B[1][2])

This calculates the maximum probability for state 1 at time 2. The maximum is taken over all possible previous states. Here, the maximum value is 0.3 * 0.7 * 0.5 = 0.105, which corresponds to the path from state 1 at time 1 to state 1 at time 2. So, Delta[1][2] = 0.105.

Delta[2][2] = max(Delta[1][1] * A[1][2] * B[2][2], Delta[2][1] * A[2][2] * B[2][2], Delta[3][1] * A[3][2] * B[2][2])

This calculates the maximum probability for state 2 at time 2. The maximum value here is 0.3 * 0.3 * 0.5 = 0.075, which corresponds to the path from state 1 at time 1 to state 2 at time 2. So, Delta[2][2] = 0.075.

Delta[3][2] = max(Delta[1][1] * A[1][3] * B[3][2], Delta[2][1] * A[2][3] * B[3][2], Delta[3][1] * A[3][3] * B[3][2])

This calculates the maximum probability for state 3 at time 2. The maximum value here is 0.3 * 0 * 0.5 + 0.1 * 0 * 0.5 + 0.1 * 1 * 0.5 = 0.05, which corresponds to the path from state 1 at time 1 to state 3 at time 2. So, Delta[3][2] = 0.05.

Next, we determine the state that led to the maximum probability in the Delta matrix for each state at time 2:

Psi[1][2] = argmax(Delta[1][1] * A[1][1] * B[1][2], Delta[2][1] * A[2][1] * B[1][2], Delta[3][1] * A[3][1] * B[1][2]) = 1

The state that led to the maximum probability for state 1 at time 2 is state 1 itself, because the maximum probability comes from the path from state 1 at time 1 to state 1 at time 2. So, Psi[1][2] = 1

Psi[2][2] = argmax(Delta[1][1] * A[1][2] * B[2][2], Delta[2][1] * A[2][2] * B[2][2], Delta[3][1] * A[3][2] * B[2][2]) = 2

The state that led to the maximum probability for state 2 at time 2 is state 1, because the maximum probability comes from the path from state 1 at time 1 to state 2 at time 2. So, Psi[2][2] = 2.

Psi[3][2] = argmax(Delta[1][1] * A[1][3] * B[3][2], Delta[2][1] * A[2][3] * B[3][2], Delta[3][1] * A[3][3] * B[3][2]) = 2

The state that led to the maximum probability for state 3 at time 2 is state 1, because the maximum probability comes from the path from state 1 at time 1 to state 3 at time 2. So, Psi[3][2] = 2.

In summary, the maximum probabilities for each state at time 2 are stored in the Delta matrix, and the state that led to the maximum probability for each state at time 2 is stored in the Psi matrix.

The Viterbi algorithm continues filling in the Delta and Psi matrices for the remaining observations:

For the third observation (O[3] = 3):

Delta[1][3] = max(Delta[1][2] * A[1][1] * B[1][3], Delta[2][2] * A[2][1] * B[1][3], Delta[3][2] * A[3][1] * B[1][3]) = max(0.105 * 0.7 * 0, 0.075 * 0.3 * 0, 0.05 * 0 * 1) = 0

Delta[2][3] = max(Delta[1][2] * A[1][2] * B[2][3], Delta[2][2] * A[2][2] * B[2][3], Delta[3][2] * A[3][2] * B[2][3]) = max(0.105 * 0.3 * 0, 0.075 * 0.6 * 0, 0.05 * 0 * 0.5) = 0

Delta[3][3] = max(Delta[1][2] * A[1][3] * B[3][3], Delta[2][2] * A[2][3] * B[3][3], Delta[3][2] * A[3][3] * B[3][3]) = max(0.105 * 0 * 1, 0.075 * 0 * 0.5, 0.05 * 1 * 1) = 0.05

Psi[1][3] = argmax(Delta[1][2] * A[1][1] * B[1][3], Delta[2][2] * A[2][1] * B[1][3], Delta[3][2] * A[3][1] * B[1][3]) = 3

Psi[2][3] = argmax(Delta[1][2] * A[1][2] * B[2][3], Delta[2][2] * A[2][2] * B[2][3], Delta[3][2] * A[3][2] * B[2][3]) = 2

Psi[3][3] = argmax(Delta[1][2] * A[1][3] * B[3][3], Delta[2][2] * A[2][3] * B[3][3], Delta[3][2] * A[3][3] * B[3][3]) = 2

After the third observation (O[3] = 3), the Delta and Psi matrices become:

Delta = [[0.3, 0.1, 0.1],

[0.105, 0.075, 0],

[0, 0, 0.05]]

Psi = [[1, 2, 3],

[2, 2, 0],

[0, 0, 2]]

Explanation: In the third step of the Viterbi algorithm, we calculate the maximum probabilities for each state at time 3, given the third observation. This is done by taking the maximum over all possible previous states, of the product of the previous state’s maximum probability, the transition probability from the previous state to the current state, and the emission probability of the third observation given the current state.

Let’s break down the calculations for the third observation (O[3] = 3):

Delta[1][3] = max(Delta[1][2] * A[1][1] * B[1][3], Delta[2][2] * A[2][1] * B[1][3], Delta[3][2] * A[3][1] * B[1][3])

This calculates the maximum probability for state 1 at time 3. The maximum is taken over all possible previous states. Here, the maximum value is 0.105 * 0.7 * 0 = 0, which corresponds to the path from state 1 at time 2 to state 1 at time 3. So, Delta[1][3] = 0.

Delta[2][3] = max(Delta[1][2] * A[1][2] * B[2][3], Delta[2][2] * A[2][2] * B[2][3], Delta[3][2] * A[3][2] * B[2][3])

This calculates the maximum probability for state 2 at time 3. The maximum value here is 0.105 * 0.3 * 0 + 0.075 * 0.6 * 0 + 0.05 * 0 * 0.5 = 0, which corresponds to the paths from state 1 at time 2 to state 2 at time 3 and from state 2 at time 2 to state 2 at time 3. So, Delta[2][3] = 0.

Delta[3][3] = max(Delta[1][2] * A[1][3] * B[3][3], Delta[2][2] * A[2][3] * B[3][3], Delta[3][2] * A[3][3] * B[3][3])

This calculates the maximum probability for state 3 at time 3. The maximum value here is 0.105 * 0 * 1 + 0.075 * 0 * 0.5 + 0.05 * 1 * 1 = 0.05, which corresponds to the path from state 1 at time 2 to state 3 at time 3. So, Delta[3][3] = 0.05.

Next, we determine the state that led to the maximum probability in the Delta matrix for each state at time 3:

Psi[1][3] = argmax(Delta[1][2] * A[1][1] * B[1][3], Delta[2][2] * A[2][1] * B[1][3], Delta[3][2] * A[3][1] * B[1][3]) = 3

The state that led to the maximum probability for state 1 at time 3 is state 3, because the maximum probability comes from the path from state 1 at time 2 to state 3 at time 3. So, Psi[1][3] = 3.

Psi[2][3] = argmax(Delta[1][2] * A[1][2] * B[2][3], Delta[2][2] * A[2][2] * B[2][3], Delta[3][2] * A[3][2] * B[2][3]) = 2

The state that led to the maximum probability for state 2 at time 3 is state 2, because the maximum probability comes from the path from state 2 at time 2 to state 2 at time 3. So, Psi[2][3] = 2.

Psi[3][3] = argmax(Delta[1][2] * A[1][3] * B[3][3], Delta[2][2] * A[2][3] * B[3][3], Delta[3][2] * A[3][3] * B[3][3]) = 2

The state that led to the maximum probability for state 3 at time 3 is state 2, because the maximum probability comes from the path from state 2 at time 2 to state 3 at time 3. So, Psi[3][3] = 2.

In summary, the maximum probabilities for each state at time 3 are stored in the Delta matrix, and the state that led to the maximum probability for each state at time 3 is stored in the Psi matrix.

The final step in the Viterbi algorithm involves finding the most probable path through the states. This is achieved by identifying the state with the highest probability in the last column of the Delta matrix and then retracing the steps from the end of the sequence to the beginning using the Psi matrix.

Maximum probability = max(Delta[1][3], Delta[2][3], Delta[3][3]) = max(0, 0, 0.05) = 0.05

State with maximum probability = argmax(Delta[1][3], Delta[2][3], Delta[3][3]) = 3

Traceback path:

State 3 at time 3 was reached from state 2 at time 2, because Psi[3][3] = 2

State 2 at time 2 was reached from state 1 at time 1, because Psi[2][2] = 1

So, the most probable path is [3, 2, 1]

Explanation: After filling in the Delta and Psi matrices for all observations, the Viterbi algorithm determines the most probable path through the states. This is done by identifying the state with the highest probability in the last column of the Delta matrix and then tracing back through the Psi matrix to find the path that led to this state.

Let’s break down the calculations:

We first identify the state with the highest probability in the last column of the Delta matrix:

Maximum probability = max(Delta[1][3], Delta[2][3], Delta[3][3]) = max(0, 0, 0.05) = 0.05

This line calculates the maximum probability among all states at the last time step (time 3). The maximum value is 0.05, which corresponds to state 3. This tells us that the most probable state at time 3 is state 3, with a probability of 0.05.

State with maximum probability = argmax(Delta[1][3], Delta[2][3], Delta[3][3]) = 3

This line determines the state that has the maximum probability at the last time step. The state with the maximum probability is state 3.

The traceback process in the Viterbi algorithm is used to determine the most likely sequence of states that resulted in the observed sequence of emissions. This is done by starting from the state with the highest probability in the last column of the Delta matrix and then using the Psi matrix to trace back to the initial state. Let’s consider the Delta and Psi matrices calculated from the previous steps.

Next, we trace back from this state to the beginning using the Psi matrix. Now, to find the most probable path, we start from the state with the maximum probability and trace back through the Psi matrix:

We start with state 3 at time 3. According to the Psi matrix, state 3 at time 3 was reached from state 2 at time 2 (Psi[3][3] = 2).
We then move to state 2 at time 2. According to the Psi matrix, state 2 at time 2 was reached from state 1 at time 1 (Psi[2][2] = 1).

So, the most probable path is [3, 2, 1]

So, the Viterbi algorithm provides a way to find the most likely sequence of states given a sequence of observations, by keeping track of the maximum probabilities and the paths leading to them.

Ending note- As we wrap up Part 7 of our series, it’s time to reflect on the insights and knowledge we’ve gathered so far. I hope you’ve been finding this journey as enlightening as I have.

But don’t go just yet! The adventure continues in Part 8: The Viterbi Algorithm in Action: Practical Examples where we’ll dive even deeper into our topic. If you’ve enjoyed our journey so far, I assure you, the best is yet to come. Your curiosity and eagerness to learn are what drive this series forward, and I can’t wait to see where our collective thirst for knowledge takes us next.

Your enthusiasm is contagious. Looking forward to seeing you in our next journey! Until next time, keep exploring, keep learning, and remember — the journey is just as important as the destination…

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod