NLP: Text Segmentation with Ngram
You have seen dictionary-based approaches to word segmentation from previous articles. These approaches required a good dictionary list of words. Unknown words such as names tend to be the typical issues resulted from these approaches. Besides that when there are multiple possible segmentations with the same number of words, these algorithms will not able to decide one over another.
So this next step is to use probabilistic approaches.
One of the issues with these dictionary-based algorithms is that it does not take into account the frequency of how the word is used in a sentence. We can use a probabilistic approach to determine the probability of each word.
So instead of just a list of words in our dictionary, we add the count of how many times we see these words in the training data. We need to have a training dataset where we can count all of the words. Then based on the probability (word count/total number of words) we can predict the likelihood of each word.
This approach is called unigram. To calculate each option, you take the product of the probability of each word.
We can use the unigram data from Google, trillion-word data set (http://norvig.com/ngrams) with a total number of token of about 1 trillion (1,024,908,267,229). Each of the word or token are counted the number of times they appear in the corpus.
As an example, “meneat” can be segmented as: “me neat”, “men eat”, or “meme at”.
at 2272272772t=1024908267229.0 #total number of words in the corpus
p_me_neat = (566617666/t * 4643885/t) = 2.5e-09
p_men_eat = (174058407/t * 29237400/t)= 4.8e-09
p_mene_at = (77555/t * 2272272772/t) = 1.7e-10
With these three possibilities, the option “men eat” has the highest score. So the algorithm would select the highest score option.
The unigram treats each word independently. It does not take into account any context between words. In bigram, we would match two terms that the algorithm has seen before. This involves getting training data of documents every two words. So instead of a single dictionary word, we generate a pair of words and its frequency. To generate bi-gram from the same example: “The men dine here”, you would get 5 bigram list as:
<B> is a token to mean a begin of a sentence and <E> means the end of a sentence.
To generate the bigram, we would add the count after each pair and increment it as we add more training data.
In this case, we update the equation to:
As an example: “thenonprofit” can be segmented as “the non profit” or “then on profit”. From the Google trillion word bigram list we get:
<s> the 258483382
the non 739031
non profit 218392
<s> then 11926082
then on 1263045
on profit 105801P(the non profit) = P(the|<s>) * P(non|the) * P(profit|non)
= P(<s> the) * P(the non) * P(non profit)
= 258483382/t* 739031/t * 218392/t = 3.88E-17P(then on profit) = P(then|<s>) * P(on|then) * P(profit|on)
= P(<s> then) * P(then on) * P(on profit)
= 11926082/t* 1263045/t * 105801/t = 1.48E-18
This results in choosing “the non profit” since it has a higher probability which seems to be more probable.
This approach includes unigram which is a single word much like the dictionary but with frequency count. This can solve cases where the pair was not seen in the training text.
This algorithm ‘N-gram’ can extend to tri-gram or bigger N. The larger N can generate a very large list of lookup terms. It can give more context as N get large. But generally, bi-gram and tri-gram work pretty well in many cases.
Khmer word bigram result is 91.8% F-measure (91.5 Precision, 92.1 Recall) by Chea Sok Huor et al . This paper also shows the differences between word bigram and Khmer Character Cluster or Orthographic Syllable bigram.
In this article, we discuss the N-gram and how this approach required that you build dictionaries of words or series of word and its probability. We go through examples of using unigram and bigram with sample calculation. Next, we can take a look at how we can use Naive Bayes to segment words.
- Next: NLP: Text Segmentation Using Naive Bayes
- Back to Series: Segmentation of Khmer Text Using Conditional Random Fields
 Chea, Sok & Huor, Top & Rithy, Ros & Hemy, Vann & Navy, Chin & Chanthirith, Chhoeun & Chhoeun, Tola. (2004). Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation. [Download]