Hidden Markov Models: The Secret Sauce in Natural Language Processing

10 min readDec 30, 2023

Part 10: The Learning Problem and the Baum-Welch Algorithm in HMM

The learning problem in a Hidden Markov Model (HMM) refers to the task of estimating the parameters of the HMM given a set of observations. The parameters of an HMM include the initial state probabilities, transition probabilities, and emission probabilities. The objective is to find the parameters that maximize the likelihood of the observed sequence given the HMM model. This is often necessary when you have a sequence of observed data and the number of states in the model, but you don’t know the exact transition and emission probabilities. The learning problem in a Hidden Markov Model (HMM) refers to estimating the parameters of the model from observed data. The goal is to find the model that maximizes the likelihood of observing the given data. In the learning problem of a Hidden Markov Model (HMM), the model learns the best parameters that maximize the likelihood of the observed data. This process is also known as model estimation or parameter estimation. Given an observation sequence O consisting of o1, o2, …, oT, compute the parameters of an HMM model λ = (A, B, π) that maximizes the probability P(O|λ) of observing O in the model λ.

The learning problem in Hidden Markov Models (HMMs) involves estimating the model parameters (transition probabilities, emission probabilities, and initial state probabilities) given a sequence of observed data. Solving this problem allows the HMM to ‘learn’ from the data and improve its predictions. In the context of Hidden Markov Models (HMMs), the learning problem is often solved using the Baum-Welch algorithm, which is a special case of the Expectation-Maximization (EM) algorithm. The Expectation-Maximization (EM) algorithm is an iterative method used to estimate the parameters of statistical models when the model depends on unobserved latent variables. The EM algorithm alternates between two steps: the Expectation (E) step and the Maximization (M) step.

Let’s illustrate this with an example. Suppose we have a simple weather prediction system with three states: Sunny, Cloudy, and Rainy. We denote the states as q1 (Sunny), q2 (Cloudy), and q3 (Rainy). We also have two possible observations: Walk and Shop. Initially, we do not know the transition probabilities (the probabilities of moving from one state to another) or the emission probabilities (the probabilities of observing Walk or Shop given each weather condition). So, we initialize these probabilities randomly. We then observe a sequence of activities over several days: [Walk, Shop, Walk, Walk, Shop]. Using this observed sequence, we aim to estimate the model parameters that would make this sequence of observations most likely.

The process of maximizing the likelihood is often done through an iterative method called the Expectation-Maximization (EM) algorithm. The algorithm consists of two main steps: the Expectation step (E-step) and the Maximization step (M-step). These steps are iterated until the model parameters converges. By iteratively performing the E-step and M-step, the EM algorithm converges to a set of model parameters that maximize the likelihood of the observed sequence. This allows for the estimation of the most likely sequence of hidden states that generated the observed sequence, as well as the probabilities of transitioning between states and emitting specific observations.

Suppose we have an HMM with unknown transition probabilities, emission probabilities, and initial state probabilities, and we observe a sequence of states. The goal is to learn the model parameters that make this observed sequence most likely. Here’s a simplified version of how the Baum-Welch algorithm works. Let’s walk through an example to illustrate the Baum-Welch algorithm: Consider a simple weather forecasting system with two states: ‘Rainy’ and ‘Sunny’. The observations are whether it rained or not the next day. Suppose we have the following observation sequence: [‘Rainy’, ‘Sunny’, ‘Rainy’].

Initialization: Start with an initial guess for the model parameters. This can be done randomly or based on some heuristic. Let’s say we assume that both states are equally likely at the start (π = [0.5, 0.5]), and that the transition probabilities are equal for both states (A = [[0.5, 0.5], [0.5, 0.5]]) and the emission probabilities are also equal for both states (B = [[0.5, 0.5], [0.5, 0.5]]).
Expectation step (E-step): It uses the forward-backward algorithm(Baum-Welch algorithm) to compute the statistics for the expectation step. In the context of the Baum-Welch algorithm, the Expectation step (E-step) involves computing the expected counts of transitions and emissions.

The parentheses group the sum of the products of the forward probabilities at the previous step and the transition probabilities. This sum is then multiplied by the emission probability.

These calculations give us the probabilities of being in a certain state at a certain time, given the observed data up to that point (Forward), and the probability of observing the remaining observations given the current state (Backward).

Let’s have a look at general form of the same concept:

Where -

here’s the breakdown of each term in the forward step formula in a HMM:

Backward Procedure:

The Backward procedure calculates the probability of observing the remaining observations given the current state.

This completes the backward procedure for the given observation sequence [‘Rainy’, ‘Sunny’, ‘Rainy’]. The backward probabilities βt(i) represent the probability of the observed sequence from time t+1 to T given that we are in state i at time t.

Let’s have a look at general form of the same concept:

Where -

let’s break down the points about the backward step formula in a HMM:

So far, we have calculated the forward probabilities (α) and the backward probabilities (β) for each state at each time step. These are not matrices, but sequences of vectors for each time step. Here they are for the given observation sequence [‘Rainy’, ‘Sunny’, ‘Rainy’]:

These probabilities are used in the Baum-Welch algorithm to estimate the model parameters. The Baum-Welch algorithm iteratively updates these matrices to maximize the likelihood of the observed data. However, we have not performed these updates yet. We would need to continue with the Expectation-Maximization steps of the Baum-Welch algorithm to calculate these matrices. let’s continue with the Baum-Welch algorithm.

Now that we have computed two important probabilities: the forward probability (α) and the backward probability (β). These probabilities are used to compute the expected number of transitions between states, which are represented by the variables ξᵢⱼ (t) and γᵢ (t).

The Expectation step involves calculating the expected number of transitions between states and the expected number of emissions from each state. This is done using the forward and backward probabilities.

let’s break down each term in these formulas:

Continuing with our example, Let’s calculate the expected number of transitions ξt(i,j)) from state i to state j

In a Hidden Markov Model (HMM), the values ξ1(1,1) and ξ2(1,1) represent the expected number of transitions from state 1 to state 1 at time steps 1 and 2, respectively. Similarly, ξ1(1,2) and ξ2(1,2) signify the expected number of transitions from state 1 to state 2 at time steps 1 and 2, respectively.

Let’s calculate these values using the provided parameters and observations for the given time steps t=1 and t=2. let’s continue the calculations for ξt(i,j) using the provided values and the formula:

Now, to calculate ξ1(1,2) and ξ2(1,2), we’ll use the same logic and formulas provided earlier, considering the transition probabilities and emission probabilities at the respective time steps.

In the context of the given calculations for a weather prediction system with two states (‘Sunny’ and ‘Rainy’), these values indicate the model’s estimation of the likelihood of transitioning from one weather state to another at specific time steps given the observed sequence.

For instance:

ξ1(1,1) and ξ2(1,1) at time steps 1 and 2, respectively, represent the expected number of transitions from state 1 (‘Sunny’) to itself (‘Sunny’) at those time steps. This may reflect the model’s assessment of the likelihood that the weather remains ‘Sunny’ over consecutive days.
ξ1(1,2) and ξ2(1,2) at time steps 1 and 2, respectively, indicate the expected number of transitions from state 1 (‘Sunny’) to state 2 (‘Rainy’) at those time steps. This value provides an estimate of how often the weather transitions from ‘Sunny’ to ‘Rainy’ based on the observed sequence.

Here are the interpretations for all four calculated values of ξ in the context of a weather prediction system- The observed sequence [‘Rainy’, ‘Sunny’, ‘Rainy’] provides a sequence of states (weather conditions in this case) observed over time. Each state in the sequence corresponds to a specific time step: ‘Rainy’ at time step 1, ‘Sunny’ at time step 2, and ‘Rainy’ again at time step:

To summarize, below values of ξ indicate the model’s estimated transition probabilities between different states (‘Rainy’ and ‘Sunny’) at each respective time step based on the observed sequence [‘Rainy’, ‘Sunny’, ‘Rainy’] within the Hidden Markov Model:

In the context of your specific observed sequence [‘Rainy’, ‘Sunny’, ‘Rainy’], if you’re only concerned with analyzing the observed transitions and predicting up to time step 2 based on the observed sequence, calculating ξ3 values might not be necessary. However, if you intend to extend the model’s prediction beyond the observed sequence, computing ξ3 values would be relevant to estimate transitions to the next time step. let’s calculate the expected number of times each state i is visited γt(I)).

Hence, the calculated values for the expected number of times each state is visited γt(i)) for the observed sequence [‘Rainy’, ‘Sunny’, ‘Rainy’] are:

The values you’ve calculated for γt(i) at time step 1 and time step 2 reflect the expected number of times each state i is visited based on the observed sequence [‘Rainy’, ‘Sunny’, ‘Rainy’] within those time steps in the Hidden Markov Model. Let’s break down what these values mean:

Ending note- And that brings us to the end of this journey. I hope you found the insights shared illuminating and the guidance practical. It’s been a pleasure to delve into these topics with you, and I’m grateful for the time you spent with me on this blog. If you’ve found this content valuable, do consider sharing it with your peers and colleagues. Your support and encouragement are what make this knowledge-sharing journey worthwhile.

But wait, our journey isn’t over yet! As we bid farewell to this chapter, I encourage you to stay tuned for the continuation of our journey in Part 11: Continuing our Discussion on the Learning Problem and the Baum-Welch Algorithm. It promises to be equally exciting and insightful. Remember, the more we learn, the more we grow. So, buckle up and get ready for another thrilling ride. See you on the other side!

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod