Probability, parts-of-speech tagging, hidden Markov states and Viterbi algorithm

Can we “teach” a computer how to recognize the parts of speech of each word in a sentence?

Hiren Nandigama
The Quantastic Journal
8 min readJul 14, 2024

--

Photo by J F on Unsplash

Give a computer a sentence and ask it to come up with the parts of speech tag for each word. It cannot do that unless we explicitly tell it how. Can we “teach” a computer how to recognize the parts of speech of each word in a sentence? We can! And all it takes is some understanding of counting, probability, hidden Markov states and the Viterbi algorithm.

Probability

Probability is chance. It tells us what the chances are of some random event occurring. If we want to find the chance of an outcome occurring, we can simply count the occurrence of favorable outcomes and divide it by the occurrence of all events.

P(E) = C(occurrence of favorable outcome) / C(all outcomes)

We say a random event here because we want our chance to be unaffected by other factors.

Example: The probability of your phone landing on its screen is 1/2 since C(E) = 1 & C(all outcomes) = 2. (Assume it’s a flat phone with equal weight distribution)

Conditional Probability

Conditional probability is the possibility of an event occurring given that another event occurred in the past. The way we calculate conditional probability is as follows:

P(E1 | E2) = P(E1 ∩ E2) / P(E2)

read as probability of E1 given E2 has occurred.

P(E1 ∩ E2) = C(E1 ∩ E2)/C(E2)

P(E2) = 1 since P(E2) has occurred and is given.

Example: The probability of your phone landing on its screen given that it slipped from your hand is 1/2 (Assume it’s a flat phone with equal weight distribution.)

P(E1 ∩ E2) = 1/2 since C(E1 ∩ E2) =1, C(E2) = 2

P(E2) = 1 since it’s given that the phone already slipped so, 100% likely the phone had slipped.

How does this tie together with parts of speech tagging?

Problem Statement: Given a sentence, we need to tag parts of speech for each word in it.

Example: Janet will back the bill

Data: Corpus which contains sentences and POS tags for each word.

Different POS tags and their abbreviations

Assumption: We will base our solution on the assumption that parts of speech are the foundation and the words are based off them. How do we know which POS would come first, which would come second given the first and so on? If we look carefully, this resembles conditional probability. Our notation would look something like this:

P(POS1 | <s>) = C(POS1 <s>) / C(<s>)

Here <s> denotes the start token. The probability means some POS1 given that it’s at the start.

P(POS2 | POS1) = C(POS2 POS1) / C(POS1)

How do we calculate this? We take the ratio of the number of times we observe POS2 after POS1 to the number of times we observe POS1.

How can we tie this to our words in the above example? Let us say consider the POS with the highest probability, i.e., MAX( P(POS | <s>) ) to be the starting POS. Can we use this information and estimate the likelihood of seeing the word given the POS? We can & it is given by:

P(W | POS) = C(W POS) / C(POS)

How do we calculate this? We take the ratio of the number of times we observe POS2 after POS1 to the number of times we observe POS1.

This conditional probability sits well with our assumption that POS tags form the basis and given them, we form our words.

We can observe here that there are two kinds of probabilities involved. One of which is hidden and forms the basis of observable states and the other is the likelihood of occurrence of these observable states, given the hidden probabilities. What we just defined informally is the Hidden Markov model. The hidden states’ probabilities are called transition probability, and the observable states’ probability is called emission probability.

Now all we need to do is find the most probable sequence of POS given the observable words, which is our problem statement.

This formula will make sense in the coming parts of this article!

Please refer to (https://www.youtube.com/watch?v=HZGCoVF3YvM) for Bayes theorem understanding. Please refer to (https://web.stanford.edu/~jurafsky/slp3/8.pdf) page 11 for the above formulas’ derivation.

As this would lengthen this article & their explanation is way more approachable!

Note: Wouldn’t P(POS10 | POS1, POS2 … POS9) give us a better estimate than P(POS10 | POS9)? It would but the prior sequence is lengthy and is difficult to compute. We might also not find such an exact sequence in our training data. We can simplify this using the Markov assumption. Simply put, Markov assumption states that the probability of occurrence of a future event only depends on its previous state and not it’s entire history. We can interpret this as the previous state sort of already encoding all past information that came before it. That is why we consider only P(POS1 | POS2) for example. This is called the bigram.

Practicals

Let’s assume that there are only 7 POS in our corpus of data. The below table contains P (POSn | POSn-1) also called the transition probabilities.

Transition probabilities calculated using corpus of data

Now let’s find the emission probabilities. The below table contains P (W | POS).

Emission probabilities calculated using corpus of data

Now we need to find the most probable sequence of POS given the observed words. If we look at the transitions below, there are P^W possibilities that could yield us the most probable path. That is 16,807 different paths that can lead to the highest probability. Is there a way we can shorten this? There is an algorithm to simplify this, and it’s called the Viterbi algorithm. Let’s see how it helps us achieve the same goal in much lesser steps.

“p” number of parts of speech and “w” number of words

Let’s first take the first two words, “Janet will”. At time step t1 we can start at each of the 7 different POS tags & compute each POS tags’ sequence probability given by:

=> ∀ POS tags compute P(POS | <s>) * P(Janet | POS) (Let’s call this x)

Note: The symbol ∀ means “for all”

From there we can go to 7 POS tags in time step t2 in 7*7 = 49 ways. For example, DT can be preceded by DT, RB, NN, JJ, VB, MD & NNP. It’s the same for other POS tags as well. If DT were to be the end POS tag, then 1 out of the 7 precedents should help compute the highest probability. Why are we only considering the transition probability between the precedent POS tag and the end POS tag to have an effect on its sequence probability? The formula for sequence probability up to time step 2 is given by:

=> argmax (t1 … tn) [P(POS | <s>) * P(Janet | POS) * P(POS t2 | POS t1) * P(will | POS t2)] (Let’s call this m)

=> P(POS | <s>) * P(Janet | POS) is already computed in time step t1 and is different for different POS t1

=> P(POS t2 | POS t1) is different for different POS t1 (Let’s call this y)

=> P(will | POS t2) is constant as POS t2 in our case is currently DT (Let’s call this z)

To find the preceding POS t1 that yields max probability coming into POS t2 we can find the argmax[(P(POS | <s>) * P(Janet | POS) * P(POS t2 | POS t1)] and discard all other paths. We can repeat the same with other POS tags at time step t2. In the end, we will be left with the most probable path until step t2 at each POS tag.

Highest sequence probability visualization until time step t2

Now let’s take the third word. Until time step t2, at each POS tag we have computed the most probable path to it. We can repeat the same steps as above and find the most probable path to from POS at t2 to POS tag at t3. We can repeat the same for POS tags at step 4 & 5. At the 5th time step, we will have arrived at the most probable sequence path ending at each final POS tag. Since at each time step, we are computing the most probable path coming into each POS tag discarding the rest, we are computing 7*7*5 = 245 paths. If we take the highest sequence probability among 7 POS tags at t5, we will have the end POS tag which will contain the most probable sequence.

Highest sequence probability visualization until time step t3

Now that we know the end tag, how we trace back to 4, 3, 2, 1 POS tag sequence which yielded the highest probability. We could, at each time step while computing the highest sequence probability of coming into each POS tag, store the preceding tag that yielded it. This way we can keep track of which preceding POS tag at time step t (n-1) helped achieve the maximum probability coming into POS tag at time step t (n+1). At time step t5 we can unfold this tracker and backtrack to find the most probable sequence tags given the observed words.

Back tracing visualization

References:

  1. (Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin) https://web.stanford.edu/~jurafsky/slp3/
  2. (Bayes theorem, the geometry of changing beliefs by 3Blue1Brown) https://www.youtube.com/watch?v=HZGCoVF3YvM
  3. (The Viterbi Algorithm : Natural Language Processing by ritvikmath) https://www.youtube.com/watch?v=IqXdjdOgXPM
  4. (Hidden Markov Model : Data Science Concepts by ritvikmath) https://www.youtube.com/watch?v=fX5bYmnHqqE&t=539s

--

--