Hidden Markov Model — Part 1 of the HMM series

Raymond Kwok

Follow

Published in

Analytics Vidhya

7 min readJul 21, 2019

--

How is it structured and what to do with it?

Note: In this series of articles I will cover what I have learnt about HMM in the past few days. I have tried to make the articles very simple to understand. Although this is not a systematic way a text book would offer you, but I hope my article could give you a hand to get through some bottlenecks in your journey. Good luck!

Part 1: Architecture of the Hidden Markov Model
Part 2: Algorithm to train a HMM: Baum-Welch algorithm
Part 3: Algorithm to predict with a trained HMM: Viterbi algorithm

Introduction to Hidden Markov Model

In very simple terms, the HMM is a probabilistic model to infer unobserved information from observed data. Take mobile phone’s on-screen keyboard as an example, you may sometimes mistype the character next to what you wanted to. Here, the character you mistype is the observed data, and the one that you intended to type in your mind is the unobserved one. As another example, your GPS readings (observed) could jump around your actual location (unobserved) due to some sources of random noise.

Interesting? There are many cases that what is observed is not the truth, and the Hidden Markov Model is one way that could give us the hidden truth back. However, it is not a magic for everything, and to use it, it has to satisfy with some assumptions, on which the HMM was founded. My approach here is to discuss the HMM with words, and supplemented with some Maths which could be skipped if it is not your cup of tea this time. Let’s start!

Sequence of states and observable

In either of my examples above, we are actually dealing with data sequences. For example, we may mistype ‘miscellaneous’ as ‘miscellameous’ which are actually both sequences of characters. And in the GPS example, if we stand still at the location (22.3525°N 113.8475°E), then this should be repeated for five times if we record the reading once every second for continuously five seconds. However, as the reading jumps around, we could end up seeing (22.3535°N 113.8455°E), (22.3585°N 113.8325°E), (22.3123°N 113.8600°E), (22.3212°N 113.8397°E), (22.3421°N 113.7989°E).

Here, for each observed data, we have an associated hidden truth. In formal HMM discussion, people call each ‘hidden truth’ as a ‘state variable’, and each ‘observed data’ as a ‘observed variable’.

HMM. The above orange boxes are ‘linked’ with one another to form a Markov chain. Each state in the chain is ‘linked’ to an observed variable.

States and observable in HMM structure

Let’s look at the figure above. In this HMM architecture, we have a sequence of states (hidden truths) ‘linked’ with each other. The arrow has a meaning of dependence here. For example, the state at time = 1 depends on the state at time = 0. This is the very important assumption in the Markov model that makes it so simple — any state depends ONLY on some of its previous state(s). I put a (s) after state because it is optional. We call it a simple Markov chain if a state depends on its first previous state, whereas it is a n-th order Markov chain if a state depends on its first n previous states.

The consideration of the dependence between states is natural, because we think there are some relations between them. For instance, the combination of the characters in a word is not random, and it has to be in the right order to be correctly understood. And the simplest dependence is that a state depends only on one previous state, which is what the discussion will be focused on.

State transition, emission, and initial state

HMM. Orange boxes for states, and green boxes for observable

Now we have put the states,the observed variables, and their dependence relations in the HMM architecture. We will go one step further to quantify the dependence, and begin that with state transition, which is the transition from the state at time k-1 to the time k, or the change from one orange box to the next.

Assume there are M possible states to choose from, namely, the state could be any one of {1, 2, 3, …, m}. And we could quantify the transition probability from state i to state j as aᵢⱼ. Since the transition could happen from any one of the M possible states to another one of the M possible states, there are in total M × M possibilities. And we could arrange them in the following matrix representation, which is called the state transition matrix A.

*state transition matrix A. aᵢⱼ represents the probability to transit from state i to state j.*

Because all aᵢⱼ are probability values, they are all bounded to be between 0 and 1, and each row has to be summed to 1. Generally speaking, we could have one such matrix A for each time step t. However, we will consider a stationary Markov chain, meaning that the matrix A is constant over time.

*aᵢⱼ represents the probability of state j given state i*

matrix multiplication of previous state with transition

Similarly, we could construct the emission matrix B for storing the probabilities of a state i resulting in an observed value j. Here, the observable is assumed to have K possible values, meaning it could take any one of the {1, 2, …, k }. The stationarity is also assumed here so we have one constant emission matrix.

*emission matrix B. bᵢⱼ represents the probability of state i to emit observable j.*

bᵢⱼ represents the probability of observing j given state i

matrix multiplication of state and emission

We are close to the full picture, and now we have the architecture, the state transition matrix A, and the emission matrix B. The last thing that we will need is the probability distribution π₀ of the initial state X₀. We need it to start the first state, so the upcoming states in the Markov chain could be estimated by the state transition matrix.

To give you some actual numbers, let us look at the following example about words. I took the 3000 words list from the Oxford Advanced Learner’s Dictionary and tabulate the probability of having a character as the starting character, and the probability of a character coming after another.

Only a partial table is shown here, and all values are in percentage. The first row is the initial state distribution, meaning the probability of having a character (state) as starting character, for example, there is 10.1% chance to see word starting with ‘c’. All remaining rows form the state transition matrix. For example, there is a 39.2% chance that in a word, ‘d’ is followed by ‘e’.

The first row of the table gives us the initial state distribution π₀, and the rest form the state transition matrix A. For the emission matrix B, let’s argue that 70% of time a user type a character correctly, otherwise a neighbor character is typed. For example, if we intend to type ‘G’ (state), then there is 70% we correctly type ‘G’ (observed), and there is 30% to get any one of {‘T’, ‘Y’, ‘F’, ‘H’, ‘C’, ‘V’, ‘B’}, or ~4.2% each. By this argument, we could construct the emission matrix.

Only a partial table is shown here, and all values are in percentage. This table form the emission matrix. For example, there is 5% chance that intentionally typing ‘d’ (state) will result in ‘c’ (observed).

Summary

Let’s take a break here. By now, we have seen that a HMM model is constructed by state variables and the associated observed variables, and for each state, how many previous states it depends on. Then the model is parameterized by the state transition matrix A, the emission matrix B, and the initial state distribution π₀. Please be noted that, the HMM we have been talking about is a stationary, simple Hidden Markov Model that takes discrete state variables, discrete observed variables, and the variables are discrete in time, therefore, it is one special case in the HMM family.

Lastly, although I have used the words list to do counting, and presented the matrices with some actual numbers, we don’t actually do it that way, nor we could always construct the emission matrix by arguing that. In the coming articles, we will talk about the Baum-Welch algorithm for training (or tuning the parameter matrices), as well as the Viterbi algorithm for inferring the whole sequence of hidden states from observed data.