POS Tagging using Hidden Markov Models (HMM) & Viterbi algorithm in NLP mathematics explained

Published in

Data Science in your pocket

10 min readApr 9, 2020

My last post dealt with the very first preprocessing step of text data, tokenization. This time, I will be taking a step further and penning down about how POS (Part Of Speech) Tagging is done.

My debut book “LangChain in your Pocket” is out now !!

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

BUT WAIT!!
What the hack is Part Of Speech?

There are thousands of words but they don’t all have the same job. For example:

some words express action
Other words express things
other words join one word to another word

We can divide all words into some categories depending upon their job in the sentence used. These categories are called Part Of Speech. Verb, Noun, Adjective, etc. are some common POS tags we all have heard somewhere during our school time.

But why do we need POS Tagging?

There are a lot of ways in which POS Tagging can be useful:

These tags reveal a lot about a word and its neighbors. Knowing whether a word is a noun or a verb tells us about likely neighboring words (nouns are preceded by determiners and adjectives, verbs by nouns).
Gives an idea about syntactic structure (nouns are generally part of noun phrases), hence helping in text parsing(the process of determining the syntactic structure of a text by analyzing its constituent words based on an underlying grammar).
Parts of speech are useful features for labeling named entities like people or organizations (Mumbai in ‘Mumbai is the city of dreams)in information extraction, or for coreference resolution (relating he to Ram in ‘Ram is a boy. He is handsome’).
A word’s part of speech can even play a role in speech recognition or synthesis, e.g., the word content is pronounced CONtent when it is a noun and conTENT when it is an adjective.

As we are clear with the motive, bring on the mathematics.

Let’s first understand what is Hidden Markov Models…

Before going for HMM, we will go through the Markov Chain models:

A Markov chain is a model that tells us something about the probabilities of sequences of random states/variables. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state. All the states before the current state have no impact on the future except via the current state.

A Markov Chain model based on Weather might have Hot, Cool, and Rainy as its states & to predict tomorrow’s weather you could examine today’s weather but yesterday’s weather isn’t significant in the prediction.

The below examples will carry on a better idea:

In the first chain, we have HOT, COLD & WARM as states & the decimal numbers represent the state transition (State1 →State2) probability i.e there is a 0.1 probability of it being COLD tomorrow if today it is HOT. Below are specified all the components of Markov Chains :

Now, moving towards HMMs…..

Sometimes, what we want to predict is a sequence of states that aren’t directly observable in the environment. Though we are given another sequence of states that are observable in the environment, these hidden states have some dependence on the observable states.

In the above HMM, we are given Walk, Shop & Clean as observable states. But we are more interested in tracing the sequence of the hidden states that will be followed which are Rainy & Sunny.

Why do we need HMM for POS Tagger?

If you notice closely, we can have the words in a sentence as Observable States (given to us in the data) but their POS Tags as Hidden states and hence we use HMM for estimating POS tags. It must be noted that we call Observable states ‘Observation’ & Hidden states ‘States’.

A Hidden Markov Model has the following components:

Q: Set of possible Tags

A: The A matrix contains the tag transition probabilities P(ti|ti−1) which represent the probability of a tag occurring given the previous tag. Example: Calculating A[Verb][Noun]:

P (Noun|Verb): Count(Noun & Verb)/Count(Verb)

O: Sequence of observation (words in the sentence)

B: The B emission probabilities, P(wi|ti), represent the probability, given a tag (say Verb), that it will be associated with a given word (say Playing). The emission probability B[Verb][Playing] is calculated using:

P(Playing | Verb): Count (Playing & Verb)/ Count (Verb)

It must be noted that we get all these Count() from the corpus itself used for training.

A sample HMM with both ‘A’ & ‘B’ matrices will look like this :

Here, the black, straight arrows represent values of Transition matrix ‘A’ while the dotted black arrow represents Emission Matrix ‘B’ for a system with Q: {MD, VB, NN}.

Taking a step further,

DECODING using HMMs:

Given an input as HMM (Transition Matrix, Emission Matrix) and a sequence of observations O = o1, o2, …, oT (Words in sentences of a corpus), find the most probable sequence of states Q = q1q2q3 …qT (POS Tags in our case)

The 2 major assumptions followed while decoding tag sequence using HMMs:

The probability of a word appearing depends only on its own tag and is independent of neighboring words and tags.
The probability of a tag depends only on the previous tag(bigram HMM) that occurred rather than the entire previous tag sequence i.e shows Markov Property. Though we can be flexible with this.

VITERBI ALGORITHM:

The decoding algorithm used for HMMs is called the Viterbi algorithm penned down by the Founder of Qualcomm, an American MNC we all would have heard of.

Without wasting time, let’s dive deeper

1st of all, we need to set up a probability matrix called lattice where we have columns as our observables (words of a sentence in the same sequence as in sentence) & rows as hidden states(all possible POS Tags are known). For the sentence :

‘Janet will back the bill’ has the below lattice:

Kindly ignore the different shades of blue used for POS Tags for now!!

Here you can observe the columns(Janet, will, back, the bill) & rows as all known POS Tags.

Interpretation of this lattice thing is important:

Each cell of the lattice is represented by V_t(j) (‘t’ represent column & j represents the row, called Viterbi path probability) representing the probability that the HMM is in state j(present POS Tag) after seeing the first t observations(past words for which lattice values has been calculated) and passing through the most probable state sequence(previous POS Tag) q_1…..q_t−1.

That means if I am at ‘back’, I have passed through ‘Janet’ & ‘will’ in the most probable states.

But, how this V_t(j) computed?

V_t(j) = max : V_t-1 * a(i,j) * b_j(O_t)

where:

where we got ‘a’(transition matrix) & ‘b’(emission matrix ) from the HMM part calculations discussed above.

Feeling Alien !!

Examples always ease out things.

Do remember we are considering a bigram HMM where the present POS Tag depends only on the previous tag.

I am picking up the same sentence ‘Janet will back the bill’.

Our aim is to get something like this:

Result: Janet/NNP will/MD back/VB the/DT bill/NN

where NNP, MD, VB, DT, and NN are all POS Tags (can’t explain about them!!)

Before beginning, let’s get our required matrices calculated using WSJ corpus with the help of the above mathematics for HMM.

A: Transition probability matrix (extracted just a part of it, else it is very big)

The 1st row in the matrix <s> represents initial_probability_distribution denoted by π in the above explanations.

B: Emission probability matrix

Now, we shall begin. Do have a look at the below image.

According to our example, we have 5 columns (representing 5 words in the same sequence). We shall start with filling values for ‘Janet’. For this, I will use P(POS Tag | start) using the transition matrix ‘A’ (in the very first row, initial_probabilities).

We will calculate the value v_1(1) (lowermost row, 1st value in column ‘Janet’). Here we got 0.28 (P(NNP | Start) from ‘A’) * 0.000032 (P(‘Janet’ | NNP)) from ‘B’ equal to 0.000009
In the same way we get v_1(2) as 0.0006(P(MD | Start)) * 0 (P (Janet | MD)) equal to 0

From the next word onwards we will be using the below-mentioned formula for assigning values:

For the word ‘will’, we will use the formula

V_t(j) = max: V_t-1 * a(i,j) * b_j(O_t)

But we know that b_j(O_t) will remain constant for all calculations for that cell. Hence while calculating max: V_t-1 * a(i,j) * b_j(O_t), if we can figure out max: V_t-1 * a(i,j) & multiply b_j(O_t), it won’t make a difference.

Here i: previous tag, j: present tag

Hence we need to calculate Max (V_t-1 * a(i,j)) where j represent current row cell in column ‘will’ (POS Tag) .

It must be noted that V_t(j) can be interpreted as V[j,t] in the Viterbi matrix to avoid confusion

LOADS OF CONFUSION

I will sort this:

Consider j = 2 i.e. POS Tag: MD. I will be calculating V_2(2)

The cell V_2(2) will get 7 values from the previous column(All 7 possible states will be sending values) & we need to pick up the max value.
Consider V_1(1) i.e NNP POS Tag. We calculated V_1(1)=0.000009
If you observe closely, V_1(2) = 0, V_1(3) = 0……V_1(7)=0 & all other values are 0 as P(Janet | other POS Tags except NNP) =0 in Emission probability matrix.
Now, we need to take these 7 values & multiply them by the transition matrix probability for POS Tag denoted by ‘j’ i.e MD for j=2
V_1(1) * P(NNP | MD) = 0.01 * 0.000009 = 0.00000009

V_1(2)* P(MD | MD) = 0 * 0.0002 = 0 …..

In the same way, as other V_1(n;n=2 →7) = 0 for ‘Janet’, we came to the conclusion that V_1(1) * P(NNP | MD) has the max value amongst the 7 values coming from the previous column.
Now we multiply this with b_j(O_t) i.e emission probability

P(‘will’ | MD) = 0.308

Hence V_2(2) = Max (V_1 * a(i,j)) * P(will | MD) = 0.000000009 * 0.308= 2.772e-8

We will calculate one more value V_2(5) i.e for POS Tag NN for the word ‘will’

Again, we will have V_1(NNP) * P(NNP | NN) as the highest because all other values in V_1=0

Hence V_2(5) = 0.000000009 * P(‘will’ | NN) = 0.000000009 * 0.0002 = 0.0000000000018

I guess you can now fill the remaining values on your own for the future states. Once we fill the matrix for the last word, we trace back to identify the Max value cells in the lattice & choose the corresponding Tag for the column (word). Like NNP will be chosen as POS Tag for ‘Janet’.

Let’s go through the algorithm as well:

Explanation:

Create lattice Viterbi N(states/POS Tags) X T(Observables/words), also initialize back pointer (N X T) which will help us trace back the most probable POS Tag sequence
Initialize the first column of Viterbi_matrix with initial_probability_distribution(1st row in ‘A’) values * Emission probabilities for all tags given the 1st word (as we did for ‘Janet’)
Set back pointers first column as 0 (representing no previous tags for the 1st word)
Now, using a nested loop with the outer loop over all words & inner loop over all states:

Calculate V(s(state),t(observable)) = max: V_t-1 * a(s’,s) * b_s(O_t)

here s’ refers to the previous state.

Set backpointer[s,t] = previous tag from which we moved on to this tag. For example. In calculating V_2(2), we calculated the value taking NNP as a previous tag. Hence we will store backpointer[MD,will] = NNP & likewise.
Once the nested loop is over, calculate best_path_probability by taking the max value from the last column of the Viterbi matrix (i.e. max probability for the best tag for the last word). Also, select the tag with the highest chance for the last word as the best_path_pointer variable
Now, using this best_path_pointer, traceback (using back pointer matrix) the previous state i.e backpointer[best_tag_chosen,’ bill’] & recursively go back to the initial word i.e ‘Janet’ where we set back pointer =0 will be used as terminating state.

A big thanks to Dr. Jurafsky for an incredible book: https://web.stanford.edu/~jurafsky/slp3/8.pdf