NLP: Text Segmentation with Ngram

You have seen dictionary-based approaches to word segmentation from previous articles. These approaches required a good dictionary list of words. Unknown words such as names tend to be the typical issues resulted from these approaches. Besides that when there are multiple possible segmentations with the same number of words, these algorithms will not able to decide one over another.

So this next step is to use probabilistic approaches.

Probabilistic Approach


This approach is called unigram. To calculate each option, you take the product of the probability of each word.

We can use the unigram data from Google, trillion-word data set ( with a total number of token of about 1 trillion (1,024,908,267,229). Each of the word or token are counted the number of times they appear in the corpus.

As an example, “meneat” can be segmented as: “me neat”, “men eat”, or “meme at”.

me  566,617,666
neat 4,643,885
men 174,058,407
eat 29,237,400
mene 77555
at 2272272772
t=1024908267229.0 #total number of words in the corpus
p_me_neat = (566617666/t * 4643885/t) = 2.5e-09
p_men_eat = (174058407/t * 29237400/t)= 4.8e-09
p_mene_at = (77555/t * 2272272772/t) = 1.7e-10

With these three possibilities, the option “men eat” has the highest score. So the algorithm would select the highest score option.


<B> The
The men
men dine
dine here
here <E>

<B> is a token to mean a begin of a sentence and <E> means the end of a sentence.

To generate the bigram, we would add the count after each pair and increment it as we add more training data.

In this case, we update the equation to:

As an example: “thenonprofit” can be segmented as “the non profit” or “then on profit”. From the Google trillion word bigram list we get:

<s> the 258483382
the non 739031
non profit 218392
<s> then 11926082
then on 1263045
on profit 105801
P(the non profit) = P(the|<s>) * P(non|the) * P(profit|non)
= P(<s> the) * P(the non) * P(non profit)
= 258483382/t* 739031/t * 218392/t = 3.88E-17
P(then on profit) = P(then|<s>) * P(on|then) * P(profit|on)
= P(<s> then) * P(then on) * P(on profit)
= 11926082/t* 1263045/t * 105801/t = 1.48E-18

This results in choosing “the non profit” since it has a higher probability which seems to be more probable.

This approach includes unigram which is a single word much like the dictionary but with frequency count. This can solve cases where the pair was not seen in the training text.


Khmer word bigram result is 91.8% F-measure (91.5 Precision, 92.1 Recall) by Chea Sok Huor et al [1]. This paper also shows the differences between word bigram and Khmer Character Cluster or Orthographic Syllable bigram.





Software Engineer and ML Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store