NLP: Text Segmentation Using Naive Bayes

Phylypo Tum
5 min readDec 18, 2019

In the previous article on Ngram, we store the probability of N sequence of words into dictionaries to figure out which are the most likely words to choose. In this article, we look into the probability from a character level using Naive Bayes.

Naive Bayes for Time Series Data

An alternate approach is to look at the frequency of each letter instead and try to infer the pattern from them. This probabilistic algorithm makes use of labeling each character into different categorical labels. We can label each of the characters in word as:

S: Single letter word
B: Beginning letter
M: Middle letter
E: Ending letter

As an example of a string “thisisatest” which should correspond to “this is a test”, we would tag it as:

Feature:  t h i s i s a t e s t
Tag: B M M E B E S B M M E

We are looking for the probability of each letter given its label. For example, calculate the probability of letter ‘t’ given that it has a label ‘B’ — a beginning letter. We can use Naive Bayes to calculate its probability.

Starts with Bayes Theorem, we calculate a conditional probability of y label given the feature x which is the input sequence of letters. We would have:

Bayes Theorem

To rewrite this in term of x features and y label, we have:

Vector x is a list of features but in our case is just one feature which is the character. The denominator P(x) is considered as a normalized factor and thus can be ignored. Naive Bayes assumes that the input values are conditional independence which means that a pair of label y and input x is independent of other pairs. Based on this simplification we get:

Naive Bayes [1]

We have x as a set feature consists of letters ‘a’ to ‘z’. We have only one feature so m =1. Then y is the label in set S, B, M, E. The P(x|y) is a conditional probability of x given y. This algorithm is a generative approach since it models the joint probability of input x and label y.

The model tries to predict the tag for a given character. For example, ‘t’ would most likely tag as B (Beginning letter) instead of E (Ending letter) given our example data with the word ‘this’ and ‘test’. Naive Bayes treats each letter and tag pair independently. It assumes that the input values (characters) are conditionally independent. That is why it is called a naive model.


So let’s go back to the example “thisisatest” above. We need to calculate the label joint probability which has a total of 11 labels. We have one single character word ‘S’, 3 beginning letters ‘B’, 4 middle letters ‘M’, and 3 ending letters ‘E’. So the probability of each label is as follow:

P(S) = 1/11 
P(B) = 3/11
P(M) = 4/11
P(E) = 3/11

The conditional probability for the letter ‘t’ given its tag S is zero, given its tag B is two (‘this’, ‘test’), given its tag M is zero, and given its tag as E is one (‘test’).

P(t|S) = 0
P(t|B) = 2/11 -- 'this', 'test'
P(t|M) = 0
P(t|E) = 1/11 -- 'test'

To calculate the tag for the letter ‘t’ from our training sentence, we have.

P(y,x) = P(x|y)P(y) -- x='t' (only 1 feature)
p(S,t) = P(t|S)P(S) = 0 * 1/11 = 0
P(B,t) = P(t|B)P(B) = 2/11 * 3/11 = 0.05
p(M,t) = P(t|M)P(M) = 0 * 4/11 = 0
p(E,t) = P(t|E)P(E) = 1/11 * 3/11 = 0.02

For the given input letter ‘t’, the algorithm will predict the tag as B (beginning letter) since it has the highest probability.

This approach only looks at a character at a time and assume the sequence is independent. This does not do well in our case as we can see that the input/tag pair are not independent. This does not look like it fits our use case. But one of the main use cases for Naive Bayes is in text classification like predicting spam and not spam email. See one of our use cases of Naive Bayes for text classification of the Khmer document in this article. In that use case, we have many more features (ie. list of words in the text) for the algorithm to evaluate rather than just one in our example here.

As a comparison to our final CRF which is the final algorithm in this series, we attempt to implement Naive Bayes using the same features used on CRF. Our test shows 89% accuracy on Khmer documents. As a caveat, we use a slightly different approach than mention here with only 2 tags (1=beginning of a word, 0=otherwise instead of ‘S’, ’B’, ’M’, ’E’). We use 14 features and implemented using a one-hot encoding which gives us over 11,000 features. Without one-hot encoding, the algorithm accuracy is only 62%. In addition, we implemented with Khmer Character cluster approach. We will explain this in more detail in the series on the CRF implementation section. See the notebook for the implementation detail.


In this article, we explore Bayes theorem and show the detail and example of Naive Bayes. We see that Naive Bayes make an assumption of the conditional independence of labels and feature. This does not do well with time series data like sequences of characters in word segmentation problem.

Next, we can take a look at an algorithm that takes into account the sequence of characters and does not make an assumption that input values are conditionally independent. This algorithm is called the Hidden Markov Model.