Hidden Markov Models: The Secret Sauce in Natural Language Processing

9 min readDec 30, 2023

Part 3: Predicting the Next Move: A Glimpse into Markov Chains and Related Probability Terminologies.

Predicting Next States in Markov Chains:

To generate a simple prediction of events, we can say today is Cloudy, and from the table above we can determine that the next state will most likely be Rain with 50%, then we go to the Rain state, and the following day it will probably still be raining with 60%.

Using a Transition Matrix alongside a State Vector representing our current conditions allows us to compute the probabilities of transitioning to the next states. For example, assuming a current state of 100% cloudy, we calculate the probabilities of transitioning to the next states by multiplying the Current State Vector with the Transition Matrix.

State2 = Vector*Matrix
=[(1 *0.1 + 0*0.3 + 0*0.4); (1*0.5 + 0*0.6 + 0*0.1); (1*0.4 + 0*0.1 + 0*0.5)]
= [C;R;S] = [0.1; 0.5; 0.4]

Most likely rain! Now we can calculate the probabilities of the state after that using the resultant vector and multiplying it with the matrix in above image.

State3 =State2*Matrix
=[(0.1*0.1 + 0.5*0.3 + 0.4*0.4); (0.1*0.5 + 0.5*0.6 + 0.4*0.1); (0.1*0.4 + 0.5*0.1 + 0.4*0.5)]
= C;R;S = [0.32; 0.39; 0.29]

Rain most likely again, and we can calculate the probabilities of the state after that by multiplying with transition matrix again. This is the method of the Markov Chain of probabilities. The Markov Chains that we have been working with are called 1st order Markov Chains, they only deal with 1 state to predict the next. In the above example, as you can see, when it transitions from cloudy to rain, it then absorbs into the rain state again, never leaving that state(refer above table). The reason this happens is because the Transition Table only holds information of the last state, we don’t know if it was sunny or raining before it was cloudy. you could have a 2nd order Markov Chain that would take the last two states and get the probability of the next states. All that is required is grouping the last two states into 1 state as in the example Table Below:

In conclusion, Higher-order Markov chains, like second or third order, offer enhanced modeling capabilities compared to their first-order counterparts. For instance, a second-order Markov chain considers transitions based on the previous two states, allowing it to discern more complex patterns and dependencies that a first-order chain might overlook. This heightened level of context incorporation often leads to improved accuracy and a richer representation of real-world systems or sequences.

Let’s consider another scenario now. In the realm of finance, Markov chains serve as a powerful tool for analyzing stock market trends. For this case, we can identify that a stock markets movement can only take on three states — Bull markets, Bear markets, Stagnant markets. Through technical analysis of historical data, certain patterns can be found as well as their estimated probabilities. After a week characterized by a bull market trend, there is a 90% chance that another bullish week will follow. Additionally, there is a 7.5% chance that the bull week instead will be followed by a bearish one or a 2.5% chance that it will be a stagnant one. After a bearish week, there’s an 80% chance that the upcoming week also will be bearish, and so on as represented in the state transition diagram below:

By compiling these probabilities into a table, we get the following state transition matrix M:

We then create a 1x3 vector C which contains information about which of the three different states any current week is in; where column 1 represents a bull week, column 2 a bear week, and column 3 a stagnant week. In this example we will choose to set the current week as bearish, resulting in the vector C = [0 1 0]. Given the state of the current week, we can then calculate the possibilities of a bull, bear, or stagnant week for any number of n weeks into the future. This is done by multiplying the vector C with the matrix, giving the following:

A characteristic of what is called a regular Markov chain is that, over a large enough number of iterations, all transition probabilities will converge to a value and remain unchanged. This means that, after a sufficient number of iterations, the likelihood of ending up in any given state of the chain is the same, regardless of where you start. So, when the chain reaches this point, we can say the transition probabilities reached a steady-state. This is what Markov ultimately wanted to prove!

From this we can conclude that as n → ∞ , the probabilities will converge to a steady-state, meaning that 63% of all weeks will be bullish, 31% bearish, and 5% Stagnant. What we also see is that the steady-state probabilities of this Markov chain do not depend upon the initial state.

In conclusion, the concept of a “steady-state” or “stationary distribution” in a Markov chain is a key part of understanding the long-term behavior of the chain. It represents the point at which the transition probabilities between states have converged to a constant value, and the likelihood of transitioning from one state to another remains constant over time.

Note — One of the fundamental assumptions of the Markov Model is the constancy of transition probabilities over time. This assumption implies that the probabilities governing transitions between states remain fixed and unchanged throughout the course of the model’s operation. the transition probabilities in an HMM are constant over time, which means that the likelihood of transitioning from one state to another remains constant, regardless of the sequence of previous states. This property is a direct result of the Markov property and is a key characteristic of HMMs. An important property of Markov chains is the property of time-homogeneity. This property means that the transition probabilities do not change over time. In other words, the probability of transitioning from one state to another at a given time is independent of the time. This property simplifies the analysis of the Markov chain and allows us to make predictions about the future states of the system.

Note — Conditional probability is a measure of the probability of an event occurring, given that another event has already occurred. If the event of interest is A and the event B is known or assumed to have occurred, “the conditional probability of A given B” or “the probability of A under the condition B”, is usually written as P(A|B). This can also be expressed as:

For example, we know that the possible outcomes of tossing a (fair) die are {1, 2, 3, 4, 5, 6}. The probability that we roll the die and we get ‘2’ is 1/6: P (2) = 1/6. This is an unconditional probability, which is the opposite of conditional probability. Unconditional probability is the probability of an occurrence or event without any other occurrence or event being taken into account. We roll the die again, but this time the probability of getting ‘2’, given that the number showing up is an even number, is 1/3 as there exist three even numbers: P (2|even) = 1/3. In this case, it is a conditional probability. It should be clear that P (A|B) typically differs from P (B|A), and falsely equating the two probabilities can lead to various errors of reasoning. For example, if a person has dengue fever, they might have a 90% chance of testing positive for the disease. In this case, what is being measured is that if event B (having dengue) has occurred, the probability of A (testing positive) given that B occurred is 90%: P (A|B) = 90%. Alternatively, if a person tests positive for dengue fever, they may have only a 15% chance of actually having this rare disease due to high false-positive rates. In this case, the probability of event B (having dengue) given that event A (testing positive) has occurred is 15%: P (B|A) = 15%

A conditional probability table (CPT) is a very useful option to show conditional probabilities, and it illustrates the relationship between events. A conditional probability table can be put into matrix form. As an example, suppose that two binary variables, X and Y have the joint probability distribution given in below table, where each of the four central cells shows the probability of a particular combination of X and Y values.

Note — Joint probability is the probability that event A and event B occur simultaneously. It is the probability of the intersection of two or more events. The probability of the intersection of A and B may be written as P (A ∩ B). The first column sum is the probability that X = 0 and Y equals any of the values it can have; that is, the column sum 6/9 is the marginal probability that X = 0. Marginal probability is the probability of event A occurring, expressed as P (A), which can be considered as unconditional probability, as its occurrence is not conditioned to any other event. If we want to find the probability that Y = 0 given that X = 0, we compute the fraction of the probabilities in the X = 0 column that have the value Y = 0, which is 4/9 ÷ 6/9 = 4/6. Likewise, in the same column, we find that the probability that Y = 1 given that X = 0 is 2/9 ÷ 6/9 = 2/6. In the same way, we can also find the conditional probabilities for Y equalling 0 or 1, given that X = 1. Combining these pieces of information gives us this table of conditional probabilities for Y. Bayes’ theorem is mathematically expressed as-

The following example helps us to understand Bayes’ theorem. “Suppose a particular test for whether someone has been using cannabis is 90% sensitive, meaning the true positive rate (TPR)=0.90. Therefore, it leads to 90% true positive results for cannabis users. The test is also 80% specific, meaning true negative rate (TNR)=0.80. Therefore, the test correctly identifies 80% of non-use for non-users but also generates 20% false positives, or false-positive rate (FPR)=0.20, for non-users. Assuming 0.05 prevalence, meaning 5% of people use cannabis, what is the probability that a random person who tests positive is really a cannabis user? The Positive predictive value (PPV) of a test is the proportion of persons who are actually positive out of all those testing positive and can be calculated from a sample as: PPV = True positive / Tested positive If sensitivity, specificity, and prevalence are known, PPV can be calculated using the Bayes theorem. Let P(User|Positive) mean “the probability that someone is a cannabis user given that they test positive,” which is what is meant by PPV. We can write:

Even if someone tests positive, the probability they are a cannabis user is only 19% because in this group, only 5% of people are users, and most positives are false positives coming from the remaining 95%”

Ending note- As we draw to a close on Part 3, I want to convey my sincere appreciation for your ongoing involvement. But remember, this is not the end, but merely a pause in our journey. We’re just gearing up for Part 4: Calculating Sequence Probabilities in Hidden Markov Models and understanding Hidden Nature of HMM, where we’ll dive deeper and uncover more intriguing aspects of our subject matter. I’m eager to continue this voyage of discovery with you.

At this juncture, I’d like to invite you to give this post a clap if you found it enlightening. It motivates me to continue delivering quality content. And if you think this post could benefit others, please consider sharing it. Spreading knowledge is a small but significant way of making a difference in the world. I eagerly look forward to continuing this journey in next part. Until then, keep the spirit of exploration alive and stay curious!!!

Hidden Markov Models: The Secret Sauce in Natural Language Processing

Written by om pramod