Hidden Markov Models: The Secret Sauce in Natural Language Processing

9 min readDec 30, 2023

Part 7: Navigating the Backward Algorithm in HMM

The Backward Algorithm in Hidden Markov Models (HMMs) is used to calculate the backward probabilities, which represent the probability of being in a certain state at a certain time, given all future observations. It’s similar to the Forward Algorithm, but it operates in the reverse direction. The backward algorithm starts with the last observation and works backwards through the sequence of observations. The key difference between the Forward and Backward algorithms is the direction in which they traverse the sequence of observations. The Forward algorithm goes from the start to the end of the sequence, while the Backward algorithm goes from the end to the start of the sequence.

Here’s a step-by-step explanation of the Backward Algorithm. Here’s again an Initialization, a Recursion and a Termination.

Initialization: We start by setting the backward probability of each state at the final time step to 1. The reason we initialize the backward probabilities to 1 at the final time step in the Backward Algorithm is because we define the backward probability as the probability of the future observations given the current state. At the final time step, there are no future observations because we’ve reached the end of the sequence. Therefore, the probability of observing the “empty” sequence of future observations given any current state is 1. This is a convention used to simplify the calculations and provide a base case for the recursion in the Backward Algorithm. In other words, it’s like saying “Given that I’m in a particular state at the last time step, what’s the probability of seeing no more observations?” Since there are indeed no more observations to see at the last time step, the probability is 1.

Mathematically, this can be represented as: βi(T)=1 for all states i, where T is the final time step.

let’s consider a simple Hidden Markov Model (HMM) with two states: ‘Rainy’ and ‘Sunny’.

Let’s say we have an observation sequence [‘Play’, ‘Study’, ‘Work’]. We want to find the backward probabilities for this sequence.

We start by setting the backward probability of each state at the final time step to 1. So, β[‘Rainy’, 3] = β[‘Sunny’, 3] = 1.

Recursion: For each preceding time step t (from T−1 to 1), we calculate the backward probability of each state i. This is done by summing over the products of the backward probabilities of each possible next state j, the transition probability from the current state i to that next state j, and the emission probability of the next observation given that next state j. Mathematically, this can be represented as:

For each preceding time step (from 2 to 1), we calculate the backward probability of each state. This is done by summing over the products of the backward probabilities of each possible next state, the transition probability from the current state to that next state, and the emission probability of the next observation given that next state.

for the second activity ‘Study’, we calculate the backward probabilities for each state:

β[Work, 2] = sum(β[Study, 3] * A[s][Work] * B[s][Study]) for all s in states

For example, to calculate β[‘Rainy’, 2], we consider both possible next states (‘Rainy’ and ‘Sunny’), and sum over:

β[‘Rainy’, 3] * A[‘Rainy’][‘Rainy’] * B[‘Rainy’][‘Work’] = 1 * 0.7 * 0.6 = 0.42
β[‘Sunny’, 3] * A[‘Rainy’][‘Sunny’] * B[‘Sunny’][‘Work’] = 1 * 0.3 * 0.1 = 0.03

So, β[‘Rainy’, 2] = 0.42 + 0.03 = 0.45.

We can calculate β[‘Sunny’, 2] in a similar way. We consider both possible next states (‘Rainy’ and ‘Sunny’), and sum over:

· β[‘Rainy’, 3] * A[‘Sunny’][‘Rainy’] * B[‘Rainy’][‘Work’] = 1 * 0.4 * 0.6 = 0.24

· β[‘Sunny’, 3] * A[‘Sunny’][‘Sunny’] * B[‘Sunny’][‘Work’] = 1 * 0.6 * 0.1 = 0.06

So, β[‘Sunny’, 2] = 0.24 + 0.06 = 0.30.

This means that given we are in the ‘Rainy’ state on the second day, the probability of observing the remaining sequence from day 3 to the end is 0.45 and given we are in the ‘Sunny’ state on the second day, the probability of observing the remaining sequence from day 3 to the end is 0.30.

We then proceed to calculate the backward probabilities for time step 1.

Finally, for the first activity ‘Play’, we calculate the backward probabilities for each state:

β[Study, 1] = sum(β[Play, 2] * A[s][Study] * B[s][Play]) for all s in states

To calculate β[‘Rainy’, 1], we consider both possible next states (‘Rainy’ and ‘Sunny’), and sum over:

β[‘Rainy’, 2] * A[‘Rainy’][‘Rainy’] * B[‘Rainy’][‘Study’] = 0.45 * 0.7 * 0.3 = 0.0945
β[‘Sunny’, 2] * A[‘Rainy’][‘Sunny’] * B[‘Sunny’][‘Study’] = 0.30 * 0.3 * 0.2 = 0.018

So, β[‘Rainy’, 1] = 0.0945 + 0.018 = 0.1125.

Similarly, to calculate β[‘Sunny’, 1], we consider both possible next states (‘Rainy’ and ‘Sunny’), and sum over:

β[‘Rainy’, 2] * A[‘Sunny’][‘Rainy’] * B[‘Rainy’][‘Study’] = 0.45 * 0.4 * 0.3 = 0.054
β[‘Sunny’, 2] * A[‘Sunny’][‘Sunny’] * B[‘Sunny’][‘Study’] = 0.30 * 0.6 * 0.2 = 0.036

So, β[‘Sunny’, 1] = 0.054 + 0.036 = 0.09.

This means that the probability of observing the remaining sequence [‘Study’, ‘Work’] given that we are in state ‘Rainy’ at time 1 is 0.1125, and the probability of observing the remaining sequence [‘Study’, ‘Work’] given that we are in state ‘Sunny’ at time 1 is 0.09.

This step is repeated for each time step starting from the last observation. After this, the algorithm has calculated the backward probabilities for each state at each time step, which can be used to calculate the total probability of the observed sequence under the model, or to perform other operations such as smoothing.

Termination: let’s the observation sequence [‘Play’, ‘Study’, ‘Work’]. Let’s also assume that the initial state probabilities are:

· Probability of ‘Rainy’ = 0.6

· Probability of ‘Sunny’ = 0.4

The total probability of the observed sequence is calculated by summing over the products of the initial state probabilities, the emission probabilities of the first observation given each state, and the backward probabilities of each state at the first time step.

So, the total contribution from state ‘Rainy’ is 0.6 * 0.1 * 0.1125 = 0.00675.

For state ‘Rainy’:

· Initial state probability = P(‘Rainy’) = 0.6

· Emission probability of ‘Play’ given ‘Rainy’ = B(‘Rainy’|’Play’) = 0.1

· Backward probability of ‘Rainy’ at time 1 = β1(‘Rainy’) = 0.1125 (calculated in the previous steps)

So, the total contribution from state ‘Rainy’ is

P(‘Rainy’) * B(‘Rainy’|’Play’) * β1(‘Rainy’)

= 0.6 * 0.1 * 0.1125

= 0.00675.

Similarly, for state ‘Sunny’:

Initial state probability = P(‘Sunny’) = 0.4
Emission probability of ‘Play’ given ‘Sunny’ = B(‘Sunny’|’Play’) = 0.7
Backward probability of ‘Sunny’ at time 1 = β1(‘Sunny’) = 0.09 (calculated in the previous steps)

So, the total contribution from state ‘Sunny’ is

P(‘Sunny’) * B(‘Sunny’|’Play’) * β1(‘Sunny’)

= 0.4 * 0.7 * 0.09

= 0.0252.

Finally, we sum these contributions to get the total probability of the observed sequence:

Total probability

= P(‘Rainy’) * B(‘Rainy’|’Play’) * β1(‘Rainy’) + P(‘Sunny’) * B(‘Sunny’|’Play’) * β1(‘Sunny’)

= 0.00675 + 0.0252 =

0.03195.

So, the probability of observing the sequence [‘Play’, ‘Study’, ‘Work’] given this HMM is approximately 0.03195.

Continuing with Om’s example from the previous part, let’s proceed further.

Initialization:

This straightforward equation informs us that at the end of the observation sequence (time T), the backward variables of every state are equal to 1.

Recursion:

The equation above instructs us to sum up the results of all multiplications involving the transition probability from the current state i to the next state j, the emission probability of the observable O at time t+1 from the next state j, and the backward variable of the next state j at time t+1.

While it may not seem immediately obvious, let’s take a closer look at the diagram provided.

We’re moving backwards in time. At time T, where the last observable in the sequence was ‘Bike’, the backward variables are initially set to 1.

By working backwards, we calculate the backward variable for the previous ‘Sunny’ state. This involves adding two multiplications:

Transition probability from ‘Sunny’ to ‘Sunny’, 0.8, multiplied by the emission probability of ‘Bike’ derived from the ‘Sunny’ state at time t+1, 0.3, and then multiplied by the backward variable of the ‘Sunny’ state at time t+1, 1. The result is 0.24.
Transition probability from ‘Sunny’ to ‘Rainy’, 0.2, multiplied by the emission probability of ‘Bike’ derived from the ‘Rainy’ state at time t+1, 0.05, and then multiplied by the backward variable of the ‘Sunny’ state at time t+1, 1. The result is 0.01.

According to the equation, we add these results to get our backward variable. Therefore, β = 0.24 + 0.01 = 0.25.

Similarly, for the ‘Rainy’ state, the process would be repeated.

Just as before, we’re going to sum up:

The transition probability from ‘Rainy’ to ‘Sunny’, 0.4, multiplied by the emission probability from the ‘Sunny’ state to ‘Bike’ at time t+1, 0.3, and then multiplied by the backward variable of the ‘Sunny’ state at time t+1, 1. The result is 0.12. The transition probability from ‘Rainy’ to ‘Rainy’, 0.6, multiplied by the emission probability from the ‘Rainy’ state to ‘Bike’ at time t+1, 0.05, and then multiplied by the backward variable of the ‘Sunny’ state at time t+1, 1. The result is 0.03.

Therefore, β = 0.12 + 0.03 = 0.15.

Similarly for the ‘Rainy’ state, the process would continue until all backward variables have been calculated.

Termination:

This equation sums up the forward variables for all states at the final time step (T), giving the probability of the observation sequence (O) given the model parameters (π). This is the final step in the backward algorithm of HMM. The equation suggests that we should sum the results of multiplying the initial probability π of state i, the emission probability b of the observable O at time t=1 from state i, and the backward variable at time t=1 of state i.

Let’s illustrate this using the diagram provided below:

As illustrated, I’ve highlighted the probabilities we need to compute our final probability.

According to the equation, we sum up the following multiplications:

The initial probability π of the ‘Sunny’ state, 0.6, multiplied by the emission probability from the ‘Sunny’ state at time t=1 to the observable ‘Paint’, 0.4, and then multiplied by the backward variable of the ‘Sunny’ state at time t=1, 0.0071. The result is 0.001704.
The initial probability π of the ‘Rainy’ state, 0.4, multiplied by the emission probability from the ‘Rainy’ state at time t=1 to the observable ‘Paint’, 0.3, and then multiplied by the backward variable of the ‘Rainy’ state at time t=1, 0.0121. The result is 0.001452.

So, P(O|λ) = 0.001704 + 0.001452 = 0.003156.

As expected, the result matches that of the Forward Algorithm.

Ending note- As we reach the conclusion of Part 7, I hope you’ve found the journey thus far both enlightening and enjoyable. Your engagement have been integral to making this experience meaningful. If you’ve found this part useful, please consider giving it a clap. It’s a small gesture, but it goes a long way in boosting my confidence.

However, our journey doesn’t end here. As we move forward, we’ll continue to explore and learn in Part 8: Cracking the Decoding Problem in HMMs: A Viterbi Algorithm Primer. I’m eager to share more insights and delve deeper into our topic.

Remember, the real beauty of learning lies not only in gaining new knowledge but also in the process of discovery itself. So, keep exploring, keep learning, and stay curious!!!

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod