# NLP: Text Segmentation with Ngram

You have seen dictionary-based approaches to word segmentation from previous articles. These approaches required a good dictionary list of words. Unknown words such as names tend to be the typical issues resulted from these approaches. Besides that when there are multiple possible segmentations with the same number of words, these algorithms will not able to decide one over another.

So this next step is to use probabilistic approaches.

# Probabilistic Approach

## Unigram

This approach is called unigram. To calculate each option, you take the product of the probability of each word.

We can use the unigram data from Google, trillion-word data set (http://norvig.com/ngrams) with a total number of token of about 1 trillion (1,024,908,267,229). Each of the word or token are counted the number of times they appear in the corpus.

As an example, “meneat” can be segmented as: “me neat”, “men eat”, or “meme at”.

`me  566,617,666neat    4,643,885men 174,058,407eat 29,237,400mene    77555at  2272272772t=1024908267229.0 #total number of words in the corpusp_me_neat = (566617666/t * 4643885/t) = 2.5e-09p_men_eat = (174058407/t * 29237400/t)= 4.8e-09p_mene_at = (77555/t * 2272272772/t) = 1.7e-10`

With these three possibilities, the option “men eat” has the highest score. So the algorithm would select the highest score option.

## Bigram

`<B> TheThe menmen dinedine herehere <E>`

<B> is a token to mean a begin of a sentence and <E> means the end of a sentence.

To generate the bigram, we would add the count after each pair and increment it as we add more training data.

In this case, we update the equation to:

As an example: “thenonprofit” can be segmented as “the non profit” or “then on profit”. From the Google trillion word bigram list we get:

`<s> the 258483382the non 739031non profit  218392<s> then 11926082then on 1263045on profit   105801P(the non profit) = P(the|<s>) * P(non|the) * P(profit|non)                  = P(<s> the) * P(the non) * P(non profit)                  = 258483382/t* 739031/t   * 218392/t = 3.88E-17P(then on profit) = P(then|<s>) * P(on|then) * P(profit|on)                  = P(<s> then) * P(then on) * P(on profit)                  = 11926082/t* 1263045/t   * 105801/t = 1.48E-18`

This results in choosing “the non profit” since it has a higher probability which seems to be more probable.

This approach includes unigram which is a single word much like the dictionary but with frequency count. This can solve cases where the pair was not seen in the training text.

## N-Gram

Khmer word bigram result is 91.8% F-measure (91.5 Precision, 92.1 Recall) by Chea Sok Huor et al . This paper also shows the differences between word bigram and Khmer Character Cluster or Orthographic Syllable bigram.

# References:

--

--

## More from Phylypo Tum

Software Engineer and ML Enthusiast