Hidden Markov Models: The Secret Sauce in Natural Language Processing

11 min readDec 30, 2023

Part 4: Calculating Sequence Probabilities in Hidden Markov Models and understanding Hidden Nature of HMM

An HMM consists of two types of variables: hidden states and observations. The relationship between the hidden states and the observations is modeled using a probability distribution. The Hidden Markov Model (HMM) is the relationship between the hidden states and the observations using two sets of probabilities: the transition probabilities and the emission probabilities.

The chain rule states that the joint probability of observing multiple events can be expressed as the product of the conditional probabilities of each event given the previous ones. For two events A and B, Chain Rule of Probability is written as:

P(A,B) = P(A∣B)⋅P(B)

Let’s calculate joint probability of a sequence of states S1, S2, …, Sik:

P(S1, S2,…, Sik)

This can be broken down into the product of the conditional probability of Sik given the previous states and the joint probability of the previous states:

P(Sik | S1, S2,…, Sik-1) P (S1, S2,…,Sik-1)

According to the Markov property, the probability of Sik depends only on the previous state Sik-1, not on the entire history. Therefore, the conditional probability can be simplified to:

P(Sik | Sik-1)

The same principle can be applied recursively to the remaining states until we get the product of the conditional probabilities for each state given its predecessor:

P(Sik | Sik-1)P(Sik-1 | Sik-2)…P(Si2 | Si1)P(Si1)

This is the Markov chain property, which states that the probability of transitioning to any particular state depends solely on the current state and not on the sequence of states that preceded it.

This formula represents the probability of a sequence of states S1,S2,…,Sk in a Markov chain, where P(S1): This term signifies the initial probability of starting at state S1.

Sik: This represents the “i-th state” in the Markov chain. It could be any state in the sequence, and it is the state we are interested in predicting.
Sik-1: This represents the “i-1-th state” in the Markov chain. It is the state that precedes the i-th state.
Sik-2: This represents the “i-2-th state” in the Markov chain. It is the state that precedes the i-1-th state.

· Si1: This represents the “first state” in the Markov chain. It is the initial state from which the sequence begins.

Let’s consider a Hidden Markov Model (HMM) for predicting the weather conditions in a city. The transition probabilities and initial probabilities are given as follows:

State-space: {Rainy, Sunny}

Initial probability: {Rainy: 0.6, Sunny: 0.4}

Transition table:

Now, what is the probability that the weather will be Sunny and Sunny when the weather was Rainy?

We denote the sequence of weather conditions as Q = { Rainy, Sunny, Sunny}.

The probability of a sequence of states is calculated by multiplying the probabilities of each transition in the sequence. This is known as the chain rule of probability and it’s based on the Markov property, which states that the probability of each subsequent state depends only on the previous state.

We can calculate the probability as follows:

P( Rainy, Sunny, Sunny)

= P(Sunny | Sunny)*P(Sunny | Rainy)*P(Rainy)

= 0.6*0.3*0.6

= 0.108

Let’s break down this formula:

P(Rainy)=0.6 (Initial probability of Rainy)
(Sunny | Rainy)=0.3P(Sunny | Rainy)=0.3 (Transition probability from Rainy to Sunny)
(Sunny | Sunny)=0.6P(Sunny | Sunny)=0.6 (Transition probability from Sunny to Sunny)

Therefore, the corrected probability that the weather will be “Sunny, Sunny” when the weather started as “Rainy” is 0.108 or 10.8%.

To solidify our understanding, let’s consider another example:

We can extend this to longer sequences. For example, if we have a sequence

Q = {Rainy, Sunny, Sunny, Rainy, Sunny}, we would compute it using:P(Rainy) * P(Sunny|Rainy) * P(Sunny|Sunny) * P(Rainy|Sunny) * P(Sunny|Rainy)

This is simply a product of the probability of transitions of events!

To emphasize, HMM contains with 5 parts:

Here are the two simplifying assumptions in a first-order Hidden Markov Model (HMM):

let’s consider a simple example of a weather prediction system.

Suppose we have two states in our HMM: sunny and rainy. These states represent the weather conditions. We also have two possible observations: walk and drive.

We know that if the weather is sunny, people are more likely to walk. If it’s rainy, people are more likely to drive. So, our emission probabilities could be defined as follows:

bi(walk) is the probability of observing a walk given that the weather is sunny. Let’s say this is 0.8.
bi(drive) is the probability of observing a drive given that the weather is sunny. Let’s say this is 0.2.
bi(walk) is the probability of observing a walk given that the weather is rainy. Let’s say this is 0.1.
bi(drive) is the probability of observing a drive given that the weather is rainy. Let’s say this is 0.9.

Now, suppose we have a sequence of observations: [walk, walk, drive]. Given this sequence, we want to know the most likely sequence of hidden states (weather conditions).

By applying the Viterbi algorithm, we can calculate the most likely sequence of hidden states. In this case, the most likely sequence is [sunny, sunny, rainy] because it maximizes the product of the transition probabilities and the emission probabilities for each state.

This is a simplified example, but it illustrates how bi(ot) (the emission probabilities) are used in practice in an HMM

In a Hidden Markov Model (HMM), the final probability distribution over states refers to the probabilities of being in each state at the end of the sequence. This is also known as the final state distribution. The final state distribution is used in the calculation of the probability of a particular sequence of states in the HMM. It is a row vector with the same number of elements as the number of states in the HMM. The probabilities in this vector are non-negative and their sum is 1. Let’s take an example to illustrate this. Suppose we have a weather forecasting system with three states: ‘Sunny’, ‘Cloudy’, and ‘Rainy’. We have the following observation sequence: [‘Sunny’, ‘Cloudy’, ‘Rainy’]. Let’s take an example to illustrate this. Suppose we have a weather forecasting system with three states: ‘Sunny’, ‘Cloudy’, and ‘Rainy’. We have the following observation sequence: [‘Sunny’, ‘Cloudy’, ‘Rainy’]. Final State Distribution = [‘Sunny’: 0.2, ‘Cloudy’: 0.3, ‘Rainy’: 0.5]. So, the final state distribution tells us that, given the observed sequence of weather conditions, there’s a 20% chance of the weather being sunny, a 30% chance of it being cloudy, and a 50% chance of it being rainy at the end of the sequence.

The Forward-Backward algorithm is a powerful tool in HMMs that allows us to calculate the posterior probabilities of each state given the observed sequence of data. Although we haven’t discussed it yet, it’s worth mentioning that the Forward-Backward algorithm operates in two stages: the Forward stage, where it computes the forward probabilities, and the Backward stage, where it computes the backward probabilities. The final state distribution is derived from the backward probabilities. We will delve deeper into the Forward-Backward algorithm in subsequent sections of this blog post.

An HMM consists of two types of variables: hidden states and observations such that

A set of N hidden states, S = {s1, s2, …, sN}.

A set of M observations, O = {o1, o2, …, oM}.

In the context of PoS tagging, the states represent the PoS tags, and the observations are the words in the sentence. The HMM for PoS tagging works by treating the PoS tags as the hidden states and the individual words as the observations. For example, if we have a sentence “The cat is on the mat”, the hidden states could be ‘NOUN’ (for ‘cat’), ‘VERB’ (for ‘is’), ‘ADP’ (for ‘on’), and ‘NOUN’ (for ‘mat’). The observations would be the individual words ‘The’, ‘cat’, ‘is’, ‘on’, and ‘mat. The goal is to predict the grammatical role of each word based on its context within a sentence. HMM is a probabilistic sequence model, i.e., for POS tagging a given sequence of words, it computes a probability distribution over possible sequences of POS labels and chooses the best label sequence. This makes HMM model a good and reliable probabilistic approach to finding POS tags for the sequence of words.

Why Hidden Markov Models are Called “Hidden”?

In this exploration, we’ll unravel the essence behind why Hidden Markov Models are aptly termed “hidden” and delve into an illustrative example that illustrates the concept of “hidden” in HMMs with a suitable example, shedding light on why these models are aptly named “hidden.”

A Hidden Markov Model (HMM) is a statistical model used to describe the probabilistic relationship between a sequence of observations and a sequence of hidden states. The term “hidden” is used because the states themselves are not directly observable, but only the outputs (or observations) produced by these states are visible. This is a key characteristic that distinguishes HMMs from Markov chains.

To understand why an HMM is called “hidden”, let’s consider an example. Suppose Jack lives in a hypothetical town with different weather conditions and moods. Jack’s mood depends only on the current day’s weather, not on the previous day’s mood. In this scenario, the hidden states represent the weather conditions (e.g., “rainy”, “sunny”, “cloudy”) and the observations represent Jack’s mood (e.g., “happy”, “sad “).

Note that, in above image, the green arrows indicate transition and red arrows the emission probabilities.

The transition probabilities between the hidden states (weather conditions) can be estimated based on historical data. For example, if it’s rainy today, it’s likely to be rainy tomorrow, and if it’s sunny today, it’s likely to be sunny tomorrow. The emission probabilities represent the likelihood of observing a particular mood given the current weather condition. For instance, if it’s rainy, Jack is likely to be sad. In this case, the weather conditions (‘rainy’, ‘sunny’, ‘cloudy’) are the hidden states, and Jack’s mood (‘happy’, ‘sad’) are the observations. The transition probabilities represent the likelihood of the weather conditions on consecutive days, and the emission probabilities represent the likelihood of Jack’s mood given the current weather condition. The hidden states are not directly observable, but they are inferred from the observations, which is why the model is called a “hidden” Markov model. To use a Markov chain to generate a sequence of Jack’s mood states, we start with an initial state (mood) and use the transition probabilities to randomly select the next state (mood). We repeat this process to generate a sequence of Jack’s moods. For example, if we start with Jack being “happy”, we might generate the following sequence based on the transition probabilities:

happy -> happy -> sad -> happy -> sad -> happy -> sad-> sad -> …

Let’s consider another simple Markov chain to understand this better. In a Markov chain, the states are directly observable. For example, consider a simple weather forecasting model where the states are ‘Rain’ and ‘Dry’. The transition probabilities between these states are given, and we can directly observe the weather states from the data.

# Example of a Markov Chain

States = [‘Rain’, ‘Dry’]

Transition_probabilities = {

‘Rain’: {‘Rain’: 0.7, ‘Dry’: 0.3},

‘Dry’: {‘Rain’: 0.2, ‘Dry’: 0.8}

}

Initial_probabilities = {‘Rain’: 0.4, ‘Dry’: 0.6}

In contrast, an HMM has additional states that are not directly observable. These states are often referred to as “hidden” states. For example, consider a weather forecasting model where the states are ‘Low’ and ‘High’ atmospheric pressure, and the observations are ‘Rain’ and ‘Dry’. The transition probabilities between the hidden states and the observations are given. In the weather forecasting example, the transition probabilities might represent the likelihood of transitioning from ‘Low Pressure’ to ‘High Pressure’ or vice versa. The emission probabilities might represent the likelihood of observing ‘Rain’ or ‘Dry’ given a ‘Low Pressure’ or ‘High Pressure’ state.

# Example of a Hidden Markov Model

States = [‘Low’, ‘High’]

Observations = [‘Rain’, ‘Dry’]

Transition_probabilities = {

‘Low’: {‘Low’: 0.7, ‘High’: 0.3},

‘High’: {‘Low’: 0.2, ‘High’: 0.8}

}

The transition probabilities represent the chances of moving from one atmospheric pressure state to another. For example:

P(Low -> Low) = 0.7, which means there is a 70% chance that the atmospheric pressure remains low given that it was low the previous day.

P(High -> Low) = 0.3, which means there is a 30% chance that the atmospheric pressure becomes low given that it was high the previous day.Observation_probabilities = {

‘Low’: {‘Rain’: 0.6, ‘Dry’: 0.4},

‘High’: {‘Rain’: 0.4, ‘Dry’: 0.3}

}

The observation probabilities represent the likelihood of observing ‘Rain’ or ‘Dry’ given a ‘Low Pressure’ or ‘High Pressure’ state. For example:

P(Rain | Low) = 0.6, which means there is a 60% chance of observing rain given that the atmospheric pressure is low.

P(Dry | Low) = 0.4, which means there is a 40% chance of observing dry weather given that the atmospheric pressure is low.

P(Rain | High) = 0.4, which means there is a 40% chance of observing rain given that the atmospheric pressure is high.

P(Dry | High) = 0.3, which means there is a 30% chance of observing dry weather given that the atmospheric pressure is high.

Initial_probabilities = {‘Low’: 0.4, ‘High’: 0.6}

In the HMM, the ‘Low’ and ‘High’ states are not directly observable, but they are inferred from the ‘Rain’ and ‘Dry’ observations. This is why the HMM is called “hidden”. The hidden states are used to explain the observed data, but they are not directly observable themselves.

Ending note- As we turn the final page of Part 4, I want to express my heartfelt gratitude for your active participation. But, the story doesn’t end here. We’re merely at the end of a chapter, not the book. Our shared journey of discovery continues as we look forward to Part 5: Diving into Hidden Markov Models: Steps and Fundamental Problems. I’m thrilled to share more insights and continue our exploration of the subject matter.

Before we part, if you’ve found this part insightful, I kindly ask you to clap for this post. if you’ve found the content enlightening, it would mean a lot if you could share it with others. Once again, thank you for your time and interest. Until then, continue to explore, learn, and stay inquisitive. See you soon!

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod