Hidden Markov Models: The Secret Sauce in Natural Language Processing

6 min readDec 30, 2023

Part 2: Building Blocks of Hidden Markov Models: Advanced Terminologies

Now let’s understand Markov chains using some examples.

Now, let’s rewrite our two examples by showing the values in the transition matrix and the prior vector(The prior vector represents the initial probabilities of starting with each coin.) Let’s illustrate both these parameters via a finite state machine (FSM) representation.

for each observed data, we have an associated hidden truth. In formal HMM discussion, people call each ‘hidden truth’ as a ‘state variable’, and each ‘observed data’ as a ‘observed variable’.

The above orange boxes are ‘linked’ with one another to form a Markov chain. Each state in the chain is ‘linked’ to an observed variable.

In addition to the transition probabilities in Markov chains model, the hidden Markov model has additional probabilities known as emission probabilities.

Emission probabilities (also known as observation probabilities), on the other hand, gauge the likelihood of a word being emitted from a given PoS tag. They are calculated based on the frequency of words associated with each PoS tag in the training corpus. For example, if we see the word ‘cat’ often associated with the ‘NOUN’ tag, then the emission probability of ‘cat’ given ‘NOUN’ would be high. This is calculated by dividing the number of times we see a particular word (e.g., ‘cat’) with a particular PoS tag (e.g., ‘NOUN’) by the total number of words associated with that PoS tag. Both the transition and emission probabilities are estimated directly from a provided corpus. This means that we analyze large volumes of text to find how likely a particular word follows another and estimate the likelihood of a word being tagged with a specific PoS tag. The emission probability matrix is a square matrix where the rows represent the hidden states and the columns represent the observable symbols. Each cell in the matrix, denoted as bi(vm), represents the probability of observing a particular symbol vm given that the system is in a particular state bi. The sum of the probabilities for all observable symbols for a given hidden state should equal 1, ensuring that the probabilities form a valid probability distribution.

Let’s consider a simple example. Suppose we have an imaginary friend named Lea who does one of four activities each day: painting, cleaning, shopping, or biking. These are our observable events. We don’t know whether each day is sunny or rainy, but we believe the weather influences Lea’s choice of activity. The weather conditions are our hidden states. Here are the components of our HMM:

Hidden States (S): Sunny and Rainy.
Observables (O): Painting, Cleaning, Shopping, and Biking.

We might have an emission probability like this:

Let’s consider another simple weather prediction system as an example. In this case:

The hidden states could represent the weather conditions: Sunny (S), Cloudy (C), and Rainy (R).
The observable symbols could represent whether it’s raining (Yes) or not (No).

The emission probability matrix for this system could be represented as follows:

In this matrix, the cell bi(vm) represents the probability of observing symbol vm given that the system is in state bi. For example, S(Yes) is 0.1, which means there is a 10% chance of observing ‘Yes’ (rainy) when the system is in the ‘Sunny’ state. Similarly, C(No) is 0.5, which means there is a 50% chance of observing ‘No’ (not rainy) when the system is in the ‘Cloudy’ state. To summarize, the emission probability matrix is a fundamental component of the HMM and it defines the probability of emitting an observation from a given hidden state.

Emission probabilities outline the likelihood of transitioning from the hidden parts of speech state to the words present in your corpus. Each row in the emission matrix corresponds to one of the hidden states, while each column corresponds to an observable word in the corpus.

Note : p(a/b) = probability of state a occurring given that the previous state is b. This is called the transition probability.

p(o/b) = probability of state b giving out o as the output. This is called as the emission probability.

Note — Refer to the visual aid below, detailing the Transition and Emission Probabilities, to gain deeper insights into the workings of the Hidden Markov Model.

In the realm of sports betting, the performance of a horse is intricately linked to its physical fitness, categorized as either Healthy or Recovering. First table outlines the probabilities governing the transitions between these fitness statuses across two consecutive races. For instance, if a horse enters a race in a Healthy state, there is a 70% chance it will maintain its health for the next race and a 30% chance it will transition to a Recovery status. Conversely, if the horse starts a race in a Recovery state, it has an equal likelihood of remaining in Recovery or transitioning to a Healthy state after that race. Second table describes the relationship between fitness status and choice of Jockey. Additionally, the manager’s choice of jockey depends on the horse’s condition. If the horse is healthy, there’s an 80% chance Jockey 1 is chosen, while Jockey 2 and Jockey 3 have only a 10% chance each. If the horse is recovering, the probability of Jockey 1, Jockey 2, and Jockey 3 being selected is 20%, 30%, and 50%, respectively.

Note — Below is an illustrative depiction showcasing the general architecture of a Hidden Markov Model (HMM) for enhanced clarity.

General Architecture of a Hidden Markov Model, where X: hidden state, Y: observables

Previously, in our unfair coin toss example, we could observe whether the coin tossed was fair or biased. Now, consider an extension of the previous setting, where, instead of showing the flipped coin, only the result of the flip was shown to us. Therefore, the coin (biased or fair) is “hidden.” The result of the flip (heads or tails) is “observable.” A typical HMM has two kinds of nodes: set of hidden states z, and observations x. The “observation” is generated or emitted from the “hidden” component. Now, we explore more examples to understand this “hidden” state.

Let’s now redraw our two HMM examples and show their parameters via the FSM representation.

In high-level, Hidden Markov Chain models consist of a set of N states. There are two important matrixes:

The transition matrix A has dimension N by N
The emission matrix B has dimension N by V

N stands for number of tags, V stands for number of words.

Ending note- As we come to the conclusion of Part 2, I am deeply touched by your consistent engagement. But don’t pack your bags just yet! We’re just warming up. Our exploration continues in Part 3: Predicting the Next Move: A Glimpse into Markov Chains and Related Probability Terminologies, where we will delve deeper into the subject matter, unraveling more layers and expanding our understanding. I can’t wait to share more with you.

Before we part ways for now, if you found this post helpful, I encourage you to give it a clap or a like. In addition, if you believe this content could be useful to others, please consider sharing it within your network. It assures me that my efforts are making an impact.

Your curiosity is infectious. Keep diving deep, reaching high, and asking questions. Can’t wait to see you again!!!

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod