Hidden Markov Models: The Secret Sauce in Natural Language Processing

15 min readDec 30, 2023

Part 1: Understanding the Basics: Terminologies and Architecture of Hidden Markov Models

In POS tagging, the idea is that the likelihood of the next word’s part of speech tag in a sentence tends to depend on the part of speech tag of the previous word. This is also known as part of speech dependencies. Because of the part of speech dependencies, we can apply probability model to estimate the POS of next word because of the previous word. One probability model is Markov Chains Model.

One of the widely used methods for Part-of-Speech (PoS) tagging in Natural Language Processing (NLP) is Hidden Markov Models (HMMs). HMMs operate on the principles of probability and are employed to capture the sequential nature of language. HMMs can capture the underlying probabilistic relationships between observed words and their corresponding PoS tags. A Hidden Markov Model (HMM) is a statistical model which is also used in machine learning.

A Markov chain is a stochastic model that represents a sequence of events, where the probability of transitioning from one event to another depends only on the current state of the system. In other words, the future state of the system is determined solely by its present state, and not by any previous states. This property is known as the Markov property. In other words, the future behavior of the system (next state) depends only on its current state and is independent of its past states, given the present state. This assumption is known as the Markovian assumption or memorylessness. The Markov property is a key characteristic of Markov models. It implies that the probability of a state occurring depends only on the current state and not on the sequence of events that preceded it.

One common real-world example of a system where the Markov property holds true is the weather prediction system. If we consider the weather as a two-state Markov chain (say, ‘sunny’ and ‘rainy’), the probability of it being sunny or rainy tomorrow depends only on whether it is currently sunny or rainy today. It doesn’t matter what the weather was like yesterday or the day before that. Suppose you were modeling weather conditions, and you chose to model weather conditions as a Markov process, then the weather condition tomorrow would depend only on the weather condition today, not on the sequence of weather conditions in previous days. Assume we have three types of weather conditions: sunny, rainy, and foggy. The problem at hand is to predict the next day’s weather using the previous day’s weather. Let qn = variable denoting the weather on the nthday. We want to find the probability of qn given weather conditions of previous {n-1} days. According to first-order Markov Assumption -The weather condition on the nth day is only dependent on the weather of (n-1)th day. i.e. tomorrow’s weather is only dependent on today’s weather conditions only. This can be mathematically written as:

The probability of an observation Oi depends on the state that produced the observation qi and not on any other states or any other observation. Formally, this property is expressed as:

Let’s take an example of a simple coin toss game. In this game, we have two states: Heads and Tails. The observation is either getting a Head or a Tail.

Let’s denote the states as q1 (Heads) and q2 (Tails).

Suppose we toss a coin once and observe the result. Let’s say the result is a Head (Oi). According to the Markov property, the probability of getting a Head in the next toss, given that we got a Head in the previous toss (q1), should be independent of the outcome of any previous tosses.

This is because the current state (getting a Head in the previous toss) is the only thing that matters to determine the next state (getting a Head in the next toss). The past states and other observations do not influence the next state.

So, the probability of getting a Head in the next toss, given that we got a Head in the previous toss (q1), is P(Oi | q1).

This example illustrates how the Markov property works in practice. The probability of an observation depends solely on the current state, not on any other states or observations.

To empathize, there are 2 fundamental assumptions for first order HMM:

· The probability of a particular state depends only on the previous state: This is known as the Markov Property.

· The probability of an observation depends on the state that produced the observation, not on any other states or observations: This is the Output Independence Assumption.

At the core of a Markov Model is the concept of states and transitions. In a Markov model, a state represents a possible condition or status of the system. All possible collections of all these states are called state-space. For instance, in a weather prediction model, states could represent different weather conditions such as ‘Rain’, ‘Sunny’, ‘Cloudy’, etc. In the context of a weather prediction model, you would represent the state space as follows:

State Space={’Rain’,’Sunny’,’Cloudy’}

These states are mutually exclusive and collectively exhaustive, meaning that at any given time, the system can be in one and only one of these states. Transitions, on the other hand, are the changes from one state to another. For example, the transition from the ‘Rain’ state to the ‘Sunny’ state would represent a change in weather. These transitions are governed by probabilities, which are determined based on historical data or expert knowledge. These probabilities are often referred to as transition probabilities.

A Markov chain can be depicted as a directed graph. Each state in the system is represented by a node (Markov chains consists of a set of n states, from q1 all the way to qn), and the transition probabilities between states are represented by directed edges.

The lines with arrows are an indication of the direction hence the name “directed graph”.

If you have a good command of the English language, you can quickly recognize the verb in a sentence and intuitively expect that the subsequent word is more likely to be a noun rather than another verb. In essence, the concept illustrated in this example suggests that the assignment of a Part-of-Speech (POS) tag to the next word is contingent upon the POS tag of the preceding word. The below image, is a great example of how a Markov Model works on a very small scale.

To show what a Markov Chain looks like, here’s an example, modelling the “Coin tossing” as a Markov Chain. The image below represents the toss of a coin. Two states are possible: heads and tails. The transition from heads to heads or heads to tails is equally probable (.5) and is independent of all preceding coin tosses.

The transition probabilities define the conditional probability of transitioning from one state to another given the current state. For example, consider a weather prediction model where the states are ‘Rain’, ‘Sunny’, and ‘Cloudy’. The transition probabilities could be defined as follows:

P(‘Rain’|’Rain’) = 0.3 (probability of remaining ‘Rain’ after ‘Rain’)
P(‘Sunny’|’Rain’) = 0.7 (probability of transitioning to ‘Sunny’ from ‘Rain’)
P(‘Cloudy’|’Rain’) = 0.2 (probability of transitioning to ‘Cloudy’ from ‘Rain’)
P(‘Rain’|’Sunny’) = 0.2 (probability of transitioning to ‘Rain’ from ‘Sunny’)
P(‘Sunny’|’Sunny’) = 0.8 (probability of remaining ‘Sunny’ after ‘Sunny’)
P(‘Cloudy’|’Sunny’) = 0.2 (probability of transitioning to ‘Cloudy’ from ‘Sunny’)
P(‘Rain’|’Cloudy’) = 0.8 (probability of transitioning to ‘Rain’ from ‘Cloudy’)
P(‘Sunny’|’Cloudy’) = 0.4 (probability of transitioning to ‘Sunny’ from ‘Cloudy’)
P(‘Cloudy’|’Cloudy’) = 0.6 (probability of remaining ‘Cloudy’ after ‘Cloudy’)

These probabilities could be estimated based on historical weather data. For example, if the historical data shows that it is 70% likely to remain ‘Rain’ after ‘Rain’, then the transition probability P(‘Rain’|’Rain’) would be 0.7

Importantly, the sum of the transition probabilities for a given state is 1. This is because the sum of the probabilities of all possible outcomes must equal 1, as per the fundamental axiom of probability

In the context of Hidden Markov Models (HMMs), transition probabilities and emission probabilities are two key components that help in the process of PoS tagging.

Transition probabilities represent the likelihood of one PoS tag following another. They are calculated based on the frequency of transitions between different PoS tags in the training corpus. For instance, if we have a sentence “The cat is on the mat”, the transition probability from ‘NOUN’ (cat) to ‘VERB’ (is) would be calculated by dividing the number of times we see a transition from ‘NOUN’ to ‘VERB’ by the total number of transitions starting from ‘NOUN’.

If the system is currently in state i, it transitions to state j with a probability denoted by Pij. The set of all such probabilities forms a transition matrix, which fully describes the dynamics of the system. A transition matrix is a square matrix used to describe the transitions of a Markov chain. Each of its entries is a nonnegative real number representing a probability. It is also called a probability matrix, transition matrix, substitution matrix, or Markov matrix. For example, consider a simple Markov chain with two states, ‘Rain’ and ‘Sunny’. The transition matrix for this Markov chain could be represented as follows:

Here, P(Rain|Rain) represents the probability of remaining ‘Rain’ given that it is currently ‘Rain’, and P(Sunny|Rain) represents the probability of transitioning to ‘Sunny’ given that it is currently ‘Rain’. The same applies to the other probabilities in the matrix.

The transition probabilities are usually represented as a square matrix where the rows and columns correspond to the hidden states. Each cell in the matrix, denoted as aij, represents the probability of transitioning from state i to state j. The sum of the probabilities for all states for a given state should equal 1, ensuring that the probabilities form a valid probability distribution. For a given state, the sum of these transition probabilities should always be one. In other words, in a transition matrix, all of the transition probabilities in each row should add up to one. A transition matrix is an (n X n) matrix with n being the number of states in the graph. Each row in the matrix represents transition probabilities of one state to all other states.

Considering above image, a more compact way to store the transition and state probabilities is using a table, better known as a “transition matrix”. A transition matrix is a table equivalent, but more compact representation of a Markov chain model.

Let’s consider a simple weather prediction system as an example. In this case:

The hidden states could represent the weather conditions: Sunny (S), Cloudy (C), and Rainy (R).

The transition probabilities for this system could be represented as follows:

In this matrix, the cell aij represents the probability of transitioning from state i to state j. For example, aS1 is 0.7, which means there is a 70% chance of transitioning from the ‘Sunny’ state to the ‘Sunny’ state. Similarly, aS2 is 0.2, which means there is a 20% chance of transitioning from the ‘Sunny’ state to the ‘Cloudy’ state.

Now, considering above transition matrix , we can notice that this model only tells us the transition probability of one state to the next when we know the previous word. Hence, this model does not show us what to do when there is no previous word. To handle this case, we add what is known as the “initial state”.

In the context of Markov chains, initial state probabilities refer to the probabilities of the system starting in a particular state. These probabilities are often represented as a vector, where each entry corresponds to the initial probability of being in a particular state. For example, if we have a Markov chain with three states labeled 1, 2, and 3, the initial state probabilities might be represented as a vector [0.5, 0.3, 0.2]. This means that the system has a 50% chance of starting in state 1, a 30% chance of starting in state 2, and a 20% chance of starting in state 3. The sum of these probabilities is 1, as per the fundamental axiom of probability. These probabilities are often represented as a vector, where each entry corresponds to the initial probability of being in a particular state. The initial state probabilities are also known as the initial distribution of the Markov chain. Let’s consider another example, if you have a Markov chain with states “rainy” and “sunny”, the initial state probabilities could be represented as a vector π0 = [0.6, 0.4], which means there is a 60% chance of the system starting in the “rainy” state and a 40% chance of it starting in the “sunny” state.

The prior vector represents the probabilities of starting in each weather condition without any previous information. In that current setup, it doesn’t handle cases when there are no previous words as in the case when beginning a sentence. To handle this, we can include an initial state. Then the transition matrix has dimension of (n+1,n).

Expanding on the previous weather condition example with states S (Sunny), C (Cloudy), and R (Rainy), let’s include the initial state probabilities. Combined Transition Matrix with Initial State-

This extended transition matrix now includes an initial state row. So, to summarize, we can use a matrix table to represent a Markov Chain model in NLP. The matrix is called Transition Matrix.

Notes about the transition matrix:

The rows in the matrix represents the current states.
The columns represent the next states.
The values represent the transition probabilities of going from the current sate to the next state. The states for this use case are the parts of speech tags.

To calculate this transition probability, we follow the formula you provided:

Transition probability = Number of (previous tag, current tag) combinations / Number of combinations with the starting previous tag

For example, if we want to calculate the transition probability for previous blue tag followed by purple tag, we count there are 2 times of (previous blue, current purple tag), and 3 times a combination starting with blue tag.

Using the example provided:

Number of (previous blue, current purple) tag combinations = 2
Number of combinations starting with the blue tag = 3

Therefore, the transition probability of the blue tag followed by the purple tag is calculated as: Transition probability = 2 / 3 ≈ 0.67 or 67%

This means that in the given dataset or corpus, there is a 67% likelihood of transitioning from a blue tag to a purple tag in the sequence.

Now, let’s consider another scenario where we’re trying to model the behavior of a pet cat using a Hidden Markov Model (HMM). The cat has two states: ‘Asleep’ and ‘Awake’, but we can’t observe these states directly. Instead, we observe whether the cat is ‘Eating’, ‘Playing’, or ‘Ignoring’.

We want to calculate the transition probabilities, which are the probabilities of transitioning from one state to another. These probabilities can be estimated from data if we have a sequence of states that the cat was in over a period of time.

Let’s say we’ve recorded the cat’s behavior over 10 days and found the following sequence of states: ‘Awake’, ‘Asleep’, ‘Awake’, ‘Awake’, ‘Asleep’, ‘Asleep’, ‘Awake’, ‘Asleep’, ‘Awake’, ‘Asleep’.

We can calculate the transition probabilities as follows:

This gives us the transition probability matrix:

In practice, these transition probabilities are often learned from data using techniques like the Baum-Welch algorithm, which is a type of Expectation-Maximization (EM) algorithm.

For better understanding let’s consider another example — Visualizing State Transitions: Markov Chain Diagram and Transition Matrix in Action:

Four-state transition matrix for the one-year performance of loan portfolio

Note — In the transition table, If a value is represented as “-”, it means that the probability of transitioning from one state to another is zero. This implies that there is no edge in the Finite State Automaton (FSA) diagram connecting those two states. If a transition probability is zero, there is no edge in the FSA diagram connecting the corresponding states. In the context of HMMs, this means that there is no direct path from state i to state j in the model, it means that there is no direct transition from state i to state j. It could also mean that the transition is extremely unlikely, given the current state and the model parameters. This could be due to the nature of the problem being modeled, or it could be a result of the data used to train the model. For instance, in a model predicting the health states of a piece of equipment, it might not make sense to allow a transition from a “healthy” state to an “unhealthy” state. In such cases, the transition probability between these states is set to zero. This is a form of domain knowledge that is incorporated into the model by enforcing constraints on the structure of the transition matrix. The data used to train the model can also influence the transition probabilities. If the data does not contain instances of a transition from one state to another, the model might estimate the transition probability as zero. This is because the model learns the transition probabilities from the data. If a particular transition does not occur in the data, the model will assign it a zero probability.

Note — In the context of Hidden Markov Models (HMMs), states can be classified as either “transient” or “recurrent”. These classifications are based on the likelihood of returning to a particular state after leaving it. A transient state is a state from which the system may not return to itself. In other words, once the system leaves a transient state, it is unlikely to return to it. This is characterized by a return probability m_i that is strictly less than 1. On the other hand, a recurrent state is a state from which the system is likely to return to itself. If the system ever enters a recurrent state, it will almost surely return to it again at some point in the future. This is characterized by a return probability m_i that equals 1.

In above image, State 0 is transient because there are arrows leading from state 0 to states 1 and 2, but no arrows lead back to state 0. This means that once the process moves from state 0 to either state 1 or state 2, it cannot return back to state 0. Hence, state 0 is a transient state.

States 1 and 2 are recurrent because they form a closed loop where one can cycle between them indefinitely. There are bidirectional arrows between states 1 and 2, indicating that the process can move back and forth between these two states. This means that these states can be revisited multiple times, hence they are recurrent states.

Ending note- As we sign off on Part 1, I want to extend my deepest gratitude for your active participation and unwavering engagement. While this marks the end of Part 1, our exploration is far from complete. We’re just about to set sail on Part 2: Building Blocks of Hidden Markov Models: Advanced Terminologies, where we’ll delve even deeper into our subject matter. I’m excited to share even more insights and continue this enriching journey with you.

If you found this post insightful, I encourage you to give it a clap or a like. This simple act of appreciation not only fuels my motivation. And if you believe this content could be useful to others, please consider sharing it within your network. Sharing is a small but powerful way to spread knowledge and benefit more people. Continue to push boundaries, embrace challenges, and nurture your growth. You’re doing great! Keep going and see you next time!

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod