Hidden Markov Models: The Secret Sauce in Natural Language Processing

16 min readDec 30, 2023

Part 6: The Likelihood Problem in HMM: A Step-by-Step Guide

The likelihood problem in Hidden Markov Models (HMMs) refers to the task of determining the probability of observing a particular sequence of outputs given a sequence of hidden states. The algorithm commonly used to solve this problem is the Forward Algorithm.

Let’s consider a simple example. Suppose we have an imaginary friend named Om who does one of four activities each day: painting, cleaning, shopping, or biking. These are our observable events. We don’t know whether each day is sunny or rainy, but we believe the weather influences Om’s choice of activity. The weather conditions are our hidden states.

Here are the components of our HMM:

Hidden States (S): Sunny and Rainy.
Observables (O): Painting, Cleaning, Shopping, and Biking.
Initial Probabilities (π): The likelihood of the state at time t = 0. Let’s say the likelihood that it is Sunny on the first day is 0.6, while the likelihood that it is Rainy is 0.4. So, π = |0.6, 0.4|.
Transition Probabilities (A): These represent the probability of transitioning to another state given the current state. For example, if the current state is Sunny, the probability that the next day is Sunny is 0.8, whereas the probability that the next day is Rainy is 0.2. Similarly, if today is Rainy, the probability that tomorrow is Rainy is 0.6, while the probability that tomorrow is Sunny is 0.4.
Emission Probabilities (B): These represent the probability of seeing a specific observable given a hidden state. For example, the probability of Cleaning on a Sunny day is 0.1, whereas the probability of Cleaning on a Rainy day is 0.45.

The likelihood problem in an HMM is to compute the probability of an observation sequence given the model. For example, if we observe the sequence Painting, Cleaning, Shopping, Biking, we want to compute the probability of this sequence given our model of Lea’s behavior and the weather. Thus, given a sequence of observations, we can use emission probabilities along with the transition probabilities (which tell us how likely we are to move from one state to another) to reason about the sequence of states that most likely led to our observations.

The evaluation problem in a Hidden Markov Model (HMM) is about determining the likelihood of a sequence of observations given a particular HMM. This problem is typically solved using the Forward Algorithm. , which consists of two main steps: first, we have to find all possible state sequences that can produce this observation sequence, and the second step is, from all possible state sequences, to find that state sequence that most likely generates this observation sequence.

The “forward probabilities” are a key part of this calculation. The algorithm works by computing forward probabilities, which represent the probability of being in a certain state at a certain time, given the history of observations and the model parameters.

Note — HMM model is represented by M=(A, B, π). The parameters of the model are usually denoted by A, B, and π. Here’s what they mean:

A: This is the state transition matrix, which defines the probabilities of moving from one state to another. Each element A[i][j] represents the probability of transitioning from state i to state j.
B: This is the emission probability matrix, which defines the likelihood of observing a particular output given that the current state is i. Each element B[i][j] represents the probability of output j given state i.
π: This is the initial state distribution, which defines the probabilities of being in each state at the beginning of the process.

The forward probabilities represent the joint probability of observing a particular sequence up to a certain point, and being in a certain state at that point. This process is also known as filtering. The forward probabilities are calculated using the forward algorithm, which is a type of dynamic programming algorithm. It involves three main steps: initialization, recursion, and termination.

In the context of Hidden Markov Models (HMMs), joint probabilities refer to the probability of a combination of events. In HMMs, this often involves the probability of a particular sequence of observations and states. For example, consider an HMM that models the weather (Sunny or Rainy) and the corresponding activities (Play, Study, Work). Let’s say we have an observation sequence of activities (“Play, Study, Work”) and we want to calculate the joint probability of this sequence and the corresponding sequence of states (“Sunny, Rainy, Sunny”). The joint probability of the sequence of states and observations can be calculated using the following formula:

p(x, y) = p(y1) * p(x1|y1) * p(y2|y1) * p(x2|y2) * … * p(yT|yT-1) * p(xT|yT)

Here, x is the sequence of observations, y is the sequence of states, p(y1) is the initial state probability, p(x1|y1) is the emission probability of the first observation given the first state, p(y2|y1) is the transition probability from the first state to the second state, and p(x2|y2) is the emission probability of the second observation given the second state. This continues for all time steps T.

Let’s illustrate this with an example. Suppose we have an HMM with two states, ‘Rain’ and ‘Dry’, and two possible observations, ‘Umbrella’ and ‘No Umbrella’. The transition probabilities are P(‘Rain’|’Rain’)=0.3, P(‘Dry’|’Rain’)=0.7, P(‘Rain’|’Dry’)=0.2, P(‘Dry’|’Dry’)=0.8. The emission probabilities are P(‘Umbrella’|’Rain’)=0.9, P(‘No Umbrella’|’Rain’)=0.1, P(‘Umbrella’|’Dry’)=0.2, P(‘No Umbrella’|’Dry’)=0.8. The initial probabilities are P(‘Rain’)=0.6 and P(‘Dry’)=0.4. Now, suppose we observe the sequence of states {Rain, Dry} and the sequence of observations {Umbrella, No Umbrella}. We can calculate the joint probability of these sequences as follows:

p({Umbrella, No Umbrella}, {Rain, Dry}) = p(‘Rain’) * p(‘Umbrella’|’Rain’) * p(‘Dry’|’Rain’) * p(‘No Umbrella’|’Dry’)

Substituting the probabilities we defined earlier, we get:

p({Umbrella, No Umbrella}, {Rain, Dry}) = 0.6 * 0.9 * 0.7 * 0.8 = 0.3024

So, the joint probability of the observation sequence {Umbrella, No Umbrella} and the state sequence {Rain, Dry} is 0.3024

The joint probability of a sequence of states can be calculated using the chain rule of probability. For example, if we want to calculate the joint probability of the state sequence {Rain, Dry}, we can compute it as follows:

P({Rain, Dry})

= P(‘Rain’) * P(‘Dry’|’Rain’)

= 0.6 * 0.7

= 0.42

Thus, the joint probability of the state sequence {Rain, Dry} is 0.42

Now, Let’s explore these stages individually to comprehend their functions within the algorithm:

Initialization: The forward algorithm in a Hidden Markov Model (HMM) begins with the initialization step, where the forward probabilities for each state at the first observation are determined. The forward probability of a state at time t=1 is calculated as the product of the initial probability of that state and the probability of the first observation given that state. For each state i, the forward variable αt(i) at time t=1 is calculated as the product of the initial state probability πi and the emission probability βi(ot), where ot is the observation at time t. This is represented mathematically as:

αt(i) = πi * βi(ot)

In this formula, πi is the initial state probability and

· αt(i) is the forward probability of state i at time t=1

πi is the initial probability of state i

· βi(ot) is the emission probability of the observation ot given the state i.

In other words, the forward variable at time t=1 for each state i is equal to the initial state probability times the emission probability of the first observation given that state. Let’s illustrate this with an example. Suppose we have an HMM with two states, ‘Rain’ and ‘Dry’, and two possible observations, ‘Umbrella’ and ‘No Umbrella’. The initial state probabilities are P(‘Rain’)=0.6 and P(‘Dry’)=0.4. The emission probabilities are P(‘Umbrella’|’Rain’)=0.9 and P(‘Umbrella’|’Dry’)=0.2.

If the first observation is ‘Umbrella’, the forward variables at time t=1 would be calculated as:

αt(Rain) = πRain * βRain(Umbrella) = 0.6 * 0.9 = 0.54

αt(Dry) = πDry * βDry(Umbrella) = 0.4 * 0.2 = 0.08

So, the forward variables at time t=1 for each state are 0.54 for ‘Rain’ and 0.08 for ‘Dry’

We can also write this formula as — α[s, 1] = π[s] * β[s, o1] for all s in states

In this formula, α[s, 1] is the forward probability of state s at time t=1, π[s] is the initial probability of state s, and β[s, o1] is the emission probability of the first observation o1 given state s.

Suppose we have an HMM that models the weather with two states: ‘Sunny’ and ‘Rainy’. The possible observations are ‘Cloudy’ and ‘Clear’. The initial state probabilities are P(‘Sunny’)=0.6 and P(‘Rainy’)=0.4. The emission probabilities are P(‘Cloudy’|’Sunny’)=0.5 and P(‘Cloudy’|’Rainy’)=0.8.

If the first observation is ‘Cloudy’, the forward variables at time t=1 would be calculated as:

α[Sunny, 1] = π [Sunny] * β[Sunny, Cloudy] = 0.6 * 0.5 = 0.3

α[Rainy, 1] = π[Rainy] * β [Rainy, Cloudy] = 0.4 * 0.8 = 0.32

So, the forward variables at time t=1 for each state are 0.3 for ‘Sunny’ and 0.32 for ‘Rainy’

Let’s consider a simple example of an HMM that models the weather (Sunny or Rainy) and the corresponding activities (Play, Study, Work). Suppose we have the following initial state probabilities (π), transition probabilities (A), and emission probabilities (B)

π = {‘Sunny’: 0.6, ‘Rainy’: 0.4}

A = {‘Sunny’: {‘Sunny’: 0.7, ‘Rainy’: 0.3}, ‘Rainy’: {‘Sunny’: 0.4, ‘Rainy’: 0.6}}

B = {‘Sunny’: {‘Play’: 0.7, ‘Study’: 0.2, ‘Work’: 0.1}, ‘Rainy’: {‘Play’: 0.1, ‘Study’: 0.3, ‘Work’: 0.6}}

And the observation sequence is [‘Play’, ‘Study’].
Case 1: First Observation is ‘Play’

Let’s calculate the initial forward probability for each state. This is done by multiplying the initial state probability by the emission probability of the first observation.

α[s, 1] = π[s] * β[s, ‘Play’] for all s in states

So, if the first observation is o1 = Play, the initial forward probabilities would be:

α = {‘Sunny’: π[‘Sunny’] * β[‘Sunny’][‘Play’], ‘Rainy’: π[‘Rainy’] * β[‘Rainy’][‘Play’]}

The initial forward probabilities are calculated as follows:

α[‘Sunny’, 1] = π[‘Sunny’] * B[‘Sunny’][‘Play’] = 0.6 * 0.7 = 0.42

α[‘Rainy’, 1] = π[‘Rainy’] * B[‘Rainy’][‘Play’] = 0.4 * 0.1 = 0.04

So, the initial forward probabilities are:

α = {‘Sunny’: 0.42, ‘Rainy’: 0.04}

Case 2: First Observation is ‘Study’

Now, let’s consider another case where the first observation is o1 = Study. Calculate the initial forward probability for each state. This is done by multiplying the initial state probability by the emission probability of the first observation.

α [s, 1] = π[s] * β[s, ‘Study’] for all s in states

So, the initial forward probabilities would be:

α = {‘Sunny’: π[‘Sunny’] * β[‘Sunny’][‘Study’], ‘Rainy’: π[‘Rainy’] * β[‘Rainy’][‘Study’]}

The initial forward probabilities are calculated as follows:

α[‘Sunny’, 1] = π[‘Sunny’] * B[‘Sunny’][‘Study’] = 0.6 * 0.2 = 0.12

α[‘Rainy’, 1] = π[‘Rainy’] * B[‘Rainy’][‘Study’] = 0.4 * 0.3 = 0.12

So, the initial forward probabilities are:

α = {‘Sunny’: 0.12, ‘Rainy’: 0.12}

Recursion: The Forward Algorithm in Hidden Markov Models (HMMs) begins with the initialization step, where the forward probabilities for the first observation are calculated. Then, it moves on to the recursive step for each subsequent observation. This recursive step involves updating the forward probabilities for each state at each time step, given the observations up to that time. After the initialization step, the next step in the forward algorithm is the iteration step. This step calculates the forward probabilities for each state at each subsequent time step. After initializing the forward probabilities for the first observation, we move on to the recursive step for each subsequent observation. Mathematically, the recursive step is represented as:

Here, αt-1(j) is the forward probability of being in state j at time t-1

Aji is the transition probability from state j to state i

Bi(ot) is the emission probability of state i emitting the t-th observation ot.

This equation says that the forward probability at time t for state i is the sum of the product of the forward probability at time t-1 for each possible state j, the transition probability from state j to state i, and the emission probability of the t-th observation given state i.

This step is repeated for each time step until the last observation. After this, the algorithm has calculated the forward probabilities for each state at each time step, which can be used to calculate the total probability of the observed sequence under the model, or to perform other operations such as Viterbi decoding or smoothing.

In the previous section, we walked through the initialization step of the Forward Algorithm, where we computed the initial forward probabilities for each state given the first observation. In this stage, we’ll calculate the forward probabilities for each state for all subsequent observations. Let’s pick up where we left off in our example and see how this plays out.

Case 1: First Observation is ‘Play’, Second Observation is ‘Study’

The forward probabilities for the second observation are calculated as follows:

α[‘Sunny’, 2]

= (α[‘Sunny’, 1] * A[‘Sunny’][‘Sunny’] + α[‘Rainy’, 1] * A[‘Rainy’][‘Sunny’]) * B[‘Sunny’][‘Study’]

= (0.42 * 0.7 + 0.04 * 0.4) * 0.2

= 0.0616

α[‘Rainy’, 2]

= (α[‘Sunny’, 1] * A[‘Sunny’][‘Rainy’] + α[‘Rainy’, 1] * A[‘Rainy’][‘Rainy’]) * B[‘Rainy’][‘Study’]

= (0.42 * 0.3 + 0.04 * 0.6) * 0.3

= 0.0384

So, the forward probabilities after the second observation are:

α = {‘Sunny’: 0.0616, ‘Rainy’: 0.0384}

Therefore, the total probability of the observation sequence [‘Play’, ‘Study’] is the sum of the final forward probabilities, which is 0.0616 + 0.0384 = 0.1. This means that, given the model parameters, the probability of observing the sequence [‘Play’, ‘Study’] is 0.1.

Case 2: First Observation is ‘Study’, Second Observation is ‘Play’

The forward probabilities for the second observation are calculated as follows:

α[‘Sunny’, 2]

= (α[‘Sunny’, 1] * A[‘Sunny’][‘Sunny’] + α[‘Rainy’, 1] * A[‘Rainy’][‘Sunny’]) * B[‘Sunny’][‘Play’]

= (0.12 * 0.7 + 0.12 * 0.4) * 0.7 = 0.0672 α[‘Rainy’, 2]

= (α[‘Sunny’, 1] * A[‘Sunny’][‘Rainy’] + α[‘Rainy’, 1] * A[‘Rainy’][‘Rainy’]) * B[‘Rainy’][‘Play’]

= (0.12 * 0.3 + 0.12 * 0.6) * 0.1

= 0.0108

So, the forward probabilities after the second observation are:

α = {‘Sunny’: 0.0672, ‘Rainy’: 0.0108}

let’s extend the example to include more observations. Let’s say the observation sequence is now [‘Play’, ‘Study’, ‘Work’, ‘Play’, ‘Study’].

We have already calculated the forward probabilities for the first two observations [‘Play’, ‘Study’]. Now, let’s calculate the forward probabilities for the next three observations [‘Work’, ‘Play’, ‘Study’].

α[‘Sunny’, 3]

= (α[‘Sunny’, 2] * A[‘Sunny’][‘Sunny’] + α[‘Rainy’, 2] * A[‘Rainy’][‘Sunny’]) * B[‘Sunny’][‘Work’]

= (0.0616 * 0.7 + 0.0384 * 0.4) * 0.1

= 0.006376

α[‘Rainy’, 3]

= (α[‘Sunny’, 2] * A[‘Sunny’][‘Rainy’] + α[‘Rainy’, 2] * A[‘Rainy’][‘Rainy’]) * B[‘Rainy’][‘Work’]

= (0.0616 * 0.3 + 0.0384 * 0.6) * 0.6

= 0.020736

Fourth Observation: ‘Play’

α[‘Sunny’, 4]

= (α[‘Sunny’, 3] * A[‘Sunny’][‘Sunny’] + α[‘Rainy’, 3] * A[‘Rainy’][‘Sunny’]) * B[‘Sunny’][‘Play’]

= (0.006376 * 0.7 + 0.020736 * 0.4) * 0.7

= 0.00349952

α[‘Rainy’, 4]

= (α[‘Sunny’, 3] * A[‘Sunny’][‘Rainy’] + α[‘Rainy’, 3] * A[‘Rainy’][‘Rainy’]) * B[‘Rainy’][‘Play’]

= (0.006376 * 0.3 + 0.020736 * 0.6) * 0.1

= 0.00146064

Fifth Observation: ‘Study’

α[‘Sunny’, 5]

= (α[‘Sunny’, 4] * A[‘Sunny’][‘Sunny’] + α[‘Rainy’, 4] * A[‘Rainy’][‘Sunny’]) * B[‘Sunny’][‘Study’]

= (0.00349952 * 0.7 + 0.00146064 * 0.4) * 0.2

= 0.000509408

α[‘Rainy’, 5]

= (α[‘Sunny’, 4] * A[‘Sunny’][‘Rainy’] + α[‘Rainy’, 4] * A[‘Rainy’][‘Rainy’]) * B[‘Rainy’][‘Study’]

= (0.00349952 * 0.3 + 0.00146064 * 0.6) * 0.3

= 0.000687744

Therefore, the total probability of the observation sequence [‘Play’, ‘Study’, ‘Work’, ‘Play’, ‘Study’] is the sum of the final forward probabilities, which is 0.000509408 + 0.000687744 = 0.001197152. This means that, given the model parameters, the probability of observing the sequence [‘Play’, ‘Study’, ‘Work’, ‘Play’, ‘Study’] is 0.001197152.

let’s consider another example using the formula for the recursive step in the Forward Algorithm. Now, let’s consider the Forward Algorithm for the sequence of observations ‘Umbrella’, ‘No Umbrella’, ‘Umbrella’.

Let’s say we have a Hidden Markov Model (HMM) with two states: ‘Rainy’ and ‘Sunny’.

The transition probabilities are as follows:

The emission probabilities are:

Initialization: At time t=1 (the first observation), the forward probability for each state i is calculated as the product of the initial state probability πi and the emission probability βi(o1)

Recursion: For each subsequent observation at time t and for each state i, we compute the forward probability as follows:

# For the second observation (‘No Umbrella’) at time t=2

Now, let’s calculate the forward probabilities for the second observation, which is ‘No Umbrella’. For each state i at time t=2, we compute the forward probability as follows:

Here, Aj,i is the transition probability from state j to state i,

Bi,O2 is the probability of the second observation given state i,

and the sum is over all states j

So, the forward probabilities at the second time step are 0.0388 for the ‘Rainy’ state and 0.1428 for the ‘Sunny’ state.

We have already calculated the forward probabilities for the first two observations (‘Umbrella’, ‘No Umbrella’) using the Forward Algorithm. Now, let’s calculate the forward probabilities for the third observation, which is ‘Umbrella’.

For each state i at time t=3, we compute the forward probability as follows:

Here, Aj,i is the transition probability from state j to state i,

Bi,O3 is the probability of the third observation given state i,

and the sum is over all states j

So, the forward probabilities at the third time step are 0.04518 for the ‘Rainy’ state and 0.016632 for the ‘Sunny’ state.

This process is repeated for each observation in the sequence. At each step, the forward probabilities are updated based on the previous forward probabilities, the transition probabilities, and the emission probabilities. This allows us to efficiently compute the likelihood of the observation sequence given the HMM.

Termination: After the initialization and iteration steps, the final step in the forward algorithm is to calculate the total probability of the observed sequence under the model. This is done by summing up the forward probabilities at the last time step for all states. The termination step in the Forward Algorithm of a Hidden Markov Model (HMM) involves summing up the forward probabilities of all states at the final time step. This gives us the total probability of the observed sequence given the model parameters. In mathematical terms, if we denote the set of all states as S and the forward probability of state i at time T(the last time step) as αi(T), the total probability P of the observed sequence O={o1,o2,…,oT} is given by:

In first example we discussed in earlier section, previously have we computed the total probability of the observed sequence under the model by summing up the probabilities at the final time step. For the second example:

Let’s discuss another example. Consider the following HMM:

State Space (S) = {“Happy”, “Sad”, “Energetic”, “Tired”}

Observation Space (O) = {“Painting”, “Cleaning the house”, “Biking”, “Shopping for groceries”}

Initial probabilities (π) = |0.6 0.4|

Initialization:

Recursion:

The forward variable of the ‘Sunny’ state at time 2 is determined by adding the products of two calculations:

The previous forward variable of the ‘Sunny’ state, 0.24, multiplied by the transition probability from ‘Sunny’ to ‘Sunny’, 0.8, and then multiplied by the emission probability from ‘Sunny’ to ‘Clean’, 0.1. The result is 0.0192.
The previous forward variable of the ‘Rainy’ state, 0.12, multiplied by the transition probability from ‘Rainy’ to ‘Sunny’, 0.4, and then multiplied by the emission probability from ‘Sunny’ to ‘Clean’, 0.1. The result is 0.0048.

According to the equation, we add these results together to obtain our forward variable. Therefore, α = 0.0192 + 0.0048 = 0.024.

Similarly, for the next step, the forward variable for the ‘Rainy’ state will be 0.054

This process continues until all forward variables have been calculated.

Termination:

The final equation indicates that to find the probability of an observation sequence O originating from an HMM model λ, we need to sum up all the forward variables at time T, meaning all the variables of every state at the end of the observation sequence. In our example above,

P(O|λ) = 0.0028512 + 0.0003048 = 0.003156

Ending note- As we reach the end of Part 6, I am filled with gratitude for your continued engagement and support. I am humbled by your participation. I’d like to extend a heartfelt thank you to everyone who has joined me on this journey. Our journey doesn’t come to a halt here. Instead, we’re about to set sail on a new voyage in Part 7: Navigating the Backward Algorithm in HMM. I’m thrilled to unveil more revelations and broaden our understanding further.

Before we part ways, I would like to kindly request you to give this post a clap if you found it helpful. This simple gesture signifies your approval and support, and it lets me know that my work is valued. Additionally, if you found this content beneficial, I would be honored if you could share it with others. By sharing, you enable more people to benefit from the knowledge and insights we’ve gathered together.

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod