NLP: Text Segmentation Using Hidden Markov Model

In Naive Bayes, we use the joint probability to calculate the probability of label y assuming the inputs values are conditionally independent. This assumption does not hold well in the text segmentation problem because sequences of characters or series of words are dependence.

Hidden Markov Model (HMM)

Previously, from the Naive Bayes model:

HMM adds state transition P(Y_k|Y_k-1). So we have:

So in HMM, we change from P(Y_k) to P(Y_k|Y_k-1).

Similar to Naive Bayes, this model is a generative approach. It models the whole probability of inputs by modeling the joint probability P(X,Y) then use Bayes theorem to get P(Y|X).

The P(X_k|Y_k) is the emission matrix we have seen earlier. The emission matrix is the probability of a character for a given tag which is used in Naive Bayes. So we have an example of matrix of joint probablity of tag and input character:

Emission Matrix

Then the P(Y_k | Y_k-1) portion is the probability of each tag transition to an adjacent tag. For example, the probability of current tag (Y_k) let us say ‘B’ given previous tag (Y_k-1) let say ‘S’. This would be 0.8 from the below chart. This is called a transition matrix.

Transition Matrix

To illustrate in a graph format, we can think of Naive Bayes joint probability between label and input but independence between each pair. It can be shown as:

[B]  [M]  [M]  [E]  [B]  [E]  [S] ...
| | | | | | |
[t] [h] [i] [s] [i] [s] [a] ...

For HMM, the graph shows the dependencies between states:

[Start]=>[B]=>[M]=>[M]=>[E]=>[B]=>[E]=>[S]...
| | | | | | |
[t] [h] [i] [s] [i] [s] [a]...

Here is another general illustration of Naive Bayes and HMM.

Next, we need to find the best sequence.

Viterbi Algorithm

We can visualize in a trellis below where each node is a distinct state for a given sequence. The arrow is a possible transition between state next sequence. The idea is to find the path that gives us the maximum probability as we start from the beginning of the sequence to the end by filling out the trellis of all possible values.

Trellis — example of sequence text ‘test’

In the original algorithm, the calculation takes the product of the probabilities and the result will get very small as the series gets longer (bigger k). At some point, the value will be too small for the floating-point precision thus end up with 0 giving an imprecise calculation. This is called “underflow”. The modification is to use a log function since it is a monotonically increasing function.

Experiment

Performance training data on 100 articles with 20% test split.

  • Training set: 799 sentences, 28,287 words
  • Test set: 182 sentences, 7,072 words

Detail metric:

              precision    recall  f1-score   support           0       0.95      0.76      0.84     25107
1 0.50 0.84 0.63 7072
accuracy 0.78 32179
macro avg 0.72 0.80 0.73 32179
weighted avg 0.85 0.78 0.79 32179

We can have a high order of HMM similar to bigram and trigram. This current description is first-order HMM which is similar to bigram. We can use second-order which is using trigram.

Other Chinese segmentation [5] shows its performance on different dataset around 83% to 89%.

Puthick Hok[1] reported the HMM Performance on Khmer documents with 95% accuracy on a lower number of unknown or mistyped words.

Shortcoming of HMMs

There is also a mismatch between learning objective function and prediction. HMM’s objective function learns a joint distribution of states and observations P(Y, X) but in the prediction tasks, we need P(Y|X).

Conclusion

References

--

--

Software Engineer and ML Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store