NLP: Text Segmentation Using Hidden Markov Model

--

In Naive Bayes, we use the joint probability to calculate the probability of label y assuming the inputs values are conditionally independent. This assumption does not hold well in the text segmentation problem because sequences of characters or series of words are dependence.

Hidden Markov Model (HMM)

As an extension of Naive Bayes for sequential data, the Hidden Markov Model provides a joint distribution over the letters/tags with an assumption of the dependencies of variables x and y between adjacent tags.

Previously, from the Naive Bayes model:

HMM adds state transition P(Y_k|Y_k-1). So we have:

So in HMM, we change from P(Y_k) to P(Y_k|Y_k-1).

Similar to Naive Bayes, this model is a generative approach. It models the whole probability of inputs by modeling the joint probability P(X,Y) then use Bayes theorem to get P(Y|X).

The P(X_k|Y_k) is the emission matrix we have seen earlier. The emission matrix is the probability of a character for a given tag which is used in Naive Bayes. So we have an example of matrix of joint probablity of tag and input character:

Then the P(Y_k | Y_k-1) portion is the probability of each tag transition to an adjacent tag. For example, the probability of current tag (Y_k) let us say ‘B’ given previous tag (Y_k-1) let say ‘S’. This would be 0.8 from the below chart. This is called a transition matrix.

To illustrate in a graph format, we can think of Naive Bayes joint probability between label and input but independence between each pair. It can be shown as:

`[B]  [M]  [M]  [E]  [B]  [E]  [S] ... |    |    |    |    |    |    |[t]  [h]  [i]  [s]  [i]  [s]  [a] ...`

For HMM, the graph shows the dependencies between states:

`[Start]=>[B]=>[M]=>[M]=>[E]=>[B]=>[E]=>[S]...          |    |    |    |    |    |    |         [t]  [h]  [i]  [s]  [i]  [s]  [a]...`

Here is another general illustration of Naive Bayes and HMM.

Next, we need to find the best sequence.

Viterbi Algorithm

To find the best score from all possible sequences is by using the Viterbi algorithm which provides an efficient way of finding the most likely state sequence with a maximum probability.

We can visualize in a trellis below where each node is a distinct state for a given sequence. The arrow is a possible transition between state next sequence. The idea is to find the path that gives us the maximum probability as we start from the beginning of the sequence to the end by filling out the trellis of all possible values.

In the original algorithm, the calculation takes the product of the probabilities and the result will get very small as the series gets longer (bigger k). At some point, the value will be too small for the floating-point precision thus end up with 0 giving an imprecise calculation. This is called “underflow”. The modification is to use a log function since it is a monotonically increasing function.

Experiment

We used an implementation by Chinese word segmentation[4] on our dataset and get 78% accuracy on 100 articles as a baseline comparison to the CRF comparison in a later article. Unlike previous Naive Bayes implementation, this approach does not use the same feature as CRF. In addition, we use the four states showed above.

Performance training data on 100 articles with 20% test split.

• Training set: 799 sentences, 28,287 words
• Test set: 182 sentences, 7,072 words

Detail metric:

`              precision    recall  f1-score   support           0       0.95      0.76      0.84     25107           1       0.50      0.84      0.63      7072    accuracy                           0.78     32179   macro avg       0.72      0.80      0.73     32179weighted avg       0.85      0.78      0.79     32179`

We can have a high order of HMM similar to bigram and trigram. This current description is first-order HMM which is similar to bigram. We can use second-order which is using trigram.

Other Chinese segmentation [5] shows its performance on different dataset around 83% to 89%.

Puthick Hok[1] reported the HMM Performance on Khmer documents with 95% accuracy on a lower number of unknown or mistyped words.

Shortcoming of HMMs

HMM is a joint distribution with the assumption of independence events of a previous token. We are not saying that each event are independence between each other but independent for a given label. HMM captures dependencies between each state and only its corresponding observations. But each segmental state may depend not just on a single character/word but all the adjacent segmental stages.

There is also a mismatch between learning objective function and prediction. HMM’s objective function learns a joint distribution of states and observations P(Y, X) but in the prediction tasks, we need P(Y|X).

Conclusion

To overcome this shortcoming, we will introduce the next approach, the Maximum Entropy Markov Model.

References

--

--

Software Engineer and ML Enthusiast