Text Classification with scikit-learn on Khmer Documents

Phylypo Tum
9 min readFeb 16, 2019



Text classification is one of the use cases of Machine Learning(ML) in natural language processing (NLP). An example of text classification is to identify if an email is a spam or not. Another is to categorize a set of documents into a list of categories. In our case, we want to create a Khmer news portal and we have crawled documents from different Khmer news sites. Many of the articles contain many traffic accident-related. These are important news but we want an ability to group them and the user can see the specific without cluttering the main page. So we want to be able to classify them as traffic accident-related or not. The source site does not tag these appropriately so we need to use the ML algorithm to do this. The final result is on our site: https://domnung.com. The same process can apply to multiple categories you can see here.

In this post, we will analyze the performances with different features on different algorithms from scikit-learn. Then chose a better performance classifier to use in production. We will outline the approach from training to saving the model and deploy it to production.


We outline the following approaches:

  1. Data loading and tagging — parse data using segmentation on Khmer text
  2. Extract features from the document using TFIDF
  3. Run a few different ML algorithms and compare the results
  4. Save the chosen model and load it to run it in production

1. Data Loading and Tagging

We already have a process that crawls different Khmer language sites to get the title and the content of the site. I manually identified 104 documents (56 documents to be accident-related, 48 for non-traffic accident related). This is considered small but we can add more as we run through a few cycles.

From the database, I have an article table with id, title, body, and category columns. The column category is a string data type that I manually entered “accident” or “non_accident” based on the content. So I just query the DB for the data with:

select id, title, body, category from dbo.article 

I created a getArticles method that outputs those fields as lists and retrieves them as follow:

(docIds, doc_titles, doc_contents, categories) = getArticles();

Now we have the text and its expected category. We can start to analyze the data and turn the text into features.

Word Segmentation

We concatenated title and content together as “text” field. Since Khmer text does not use space to separate between words, we build our custom Khmer word segment process. We segmented each word with space so we can extract the features in the next step. See more detail about the segmentation process in this post.

As an example, I have the following phrase in Khmer. It appears to be just one long string, then I segmented into words using space like below:

របស់យើងប្រសើរជាងមុន។ => របស់ យើង ប្រសើរ ជាង មុន។

So as the result from this step you have the content of the article with the space-separated word.

2. Extracting Features

This step is to take the segmented word and count the occurrence of the word relative to all of the documents. We already have an existing TFIDF process but we didn’t have features for extracting bigram (two consecutive terms). This approach later is shown to have a better result. So we will be using TfidfVectorizer in scikit-learn libraries.

from sklearn.feature_extraction.text import TfidfVectorizer

The current default option of TfidfVectorizer does not handle Khmer Unicode properly. It tokenizes and ignores the Khmer subscript character. Since we already tokenized our text, we will use a custom tokenizer (tokenizersplit) that just split by space.

def tokenizersplit(str):
return str.split();
tfidf = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)tfidf_vect.fit(df['text'])

Here some of the option detail for TfidfVectorizer:

  • encoding: is set to ‘utf-8’ to handle Khmer Unicode characters
  • min_df: ignore term with number doc count less than a given value. Value 2 means, a term must exist at least on 2 documents to be counted.
  • ngram_range: ngram you want to extract (more detail below)
  • max_features: max number of features

Unigram vs Bigram

The tfidf process produces a list of distinct vocabulary words. These words have the value that identifies the relevancy of words in the document. The high-value terms indicate a high degree of relevancy. Those terms become features for the ML algorithm to use. The process also can produce two adjacent words as one term so it can have a better context. For example “car accident” as a term versus two terms of “car” and “accident” that can occur anywhere in the document. The latter two words can be from “a kid has an accident in a car” which is unrelated to a “car accident”. So the two consecutive terms or bigram can be a powerful feature to explore.

So we want to compare the performance of ngram_range options which produce unigram verses unigram and bigram. The bigram would give additional more features and we will have to see if that increases the performance significantly enough.

Here is the result for unigram on 100 articles and product 3879 unigram term. The result is on two runs with a randomize train/validation set with a 30% validation set.

Naive Bayes accuracy:         0.63, 0.63
Logistic Regression accuracy: 0.90, 0.96
SVM accuracy: 0.50, 0.50
Random Forest accuracy: 0.83, 0.96

For unigram and bigram, on the same 100 articles, it product 14692 terms (unigram + bigram). The result is:

Naive Bayes accuracy:         0.60, 0.60
Logistic Regression accuracy: 0.77, 0.80
SVM accuracy: 0.50, 0.50
Random Forest accuracy: 0.83, 0.93

The result shows that the Logistic Regression perform pretty good on unigram (0.96). Doing poorly on unigram plus bigram (0.80). Random Forest is pretty good using unigram (0.96) and not too bad on unigram with bigram (0.93). Overall the unigram accuracy is better than unigram plus bigram. I was expecting bigram to perform better but it seems like adding bigram is overfitting or add too much noise. So we will only use unigram for our approach.

3. Evaluate Performance with Different Classifier

To evaluate the number of training data, I am going to test several different document count. I tried between 50, 75, 100 documents. Here are the results of the different algorithms:

Accuracy on different algorithms on 3 different training size

With 50 or fewer documents, the performance is worse than a random guest. It is not usable. But with just 100 documents we can see a decent performance.

Performance analysis

It looks like the more documents we have the better the performance as expected. So we have some room for more training documents. We are going to choose a top-performing algorithm from this training set on 100 docs which is Logistic Regression. Now we want to see if splitting the training and validation set would help.

from sklearn import model_selection, linear_model, metrics
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['text'], df['cat'], test_size=0.35, random_state=0)
from sklearn import metrics
m.fit(xtrain_tfidf, train_y)
y_pred = m.predict(xvalid_tfidf)
print(metrics.classification_report(valid_y, y_pred,
## 85 docs for training, 15 docs for validation set
precision recall f1-score support

no_accident 0.93 1.00 0.96 13
accident 1.00 0.92 0.96 13

avg / total 0.96 0.96 0.96 26

## 65 docs on training, 35 docs for validation set
precision recall f1-score support

no_accident 0.80 1.00 0.89 20
accident 1.00 0.71 0.83 17

avg / total 0.89 0.86 0.86 37

Increase the training by lowering the validation set does increase the performance a little bit from 0.89 to 0.96 on precision and similarly on recall. We don’t want the validation too small or you cannot be certain about the performance. But I think around 30% validation set is good.

4. Save Model and Load it Run in Production

After we satisfied with our model, we chose one and run the training on all documents without having to split into the train and validation set so we make use of all the label data. Then save to model so we can run on new articles.

You need to save the TFIDF fit so that it can keep the same number of vocabulary to process the new data. To save we use pickle library to dump the data as follow:

import pickle;
tfidf = TfidfVectorizer(tokenizer=tokenizersplit, encoding=
pickle.dump(tfidf, open(‘
feature_100.pkl’, ‘wb’));

Similarly when we want to save the model we use:

...model = linear_model.LogisticRegression()
model.fit(features, labels)
import pickle
pickle.dump(model, open("model.pkl", 'wb'))

To load we just use the load function from pickle. To load the model, you will need the custom tokenizer function we called it tokenizersplit defined :

import pickle
# needed to load pickle tfidf
def tokenizersplit(str):
return str.split();
tfidf = pickle.load(open('feature_100.pkl', 'rb'))
loaded_model = pickle.load(open('model.pkl', 'rb'))

Steps by Steps detail on Training Process

To put all the training processes together, here is the detail of the training and saving process.

  1. Load training documents
(docIds, docTitles, docBodies, categories) = getArticles();

2. Run segmentation and create pandas data frame

(token_bodies, token_titles) = tokenizeDocs(docTitles, docBodies);
# concatenate title, body with space into tokenText
tokenText = [token_titles[i] + " " + token_bodies[i] for i in xrange(len(token_titles))]
import pandas as pd
df = pd.DataFrame({id: docIds});
‘text’] = tokenText;
‘cat’] = categories;

3. Run tfidf fit

tfidf = TfidfVectorizer(tokenizer=tokenizersplit, encoding=’utf-8');

4. Save the TFIDF fit (so tfidf can reuse on new docs so feature size would match in the model)

import pickle;
pickle.dump(tfidf, open(‘
feature_100.pkl’, ‘wb’));

5. Run tfidf transform

features = tfidf.transform(df.text)

6. Split data into train and validation set

from sklearn import model_selection, preprocessing
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['text'], df['cat'], test_size=0.30, random_state=1)

# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

7. Run tfidf transform on train and validation set

xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)

8. Run model fit and predict on training and validation set

from sklearn import metrics, linear_model, naive_bayes, metrics, svm, xgboostdef train_model(classifier, trains, t_labels, valids, v_labels):
# fit the training dataset on the classifier
classifier.fit(trains, t_labels)

# predict the labels on validation dataset
predictions = classifier.predict(valids)

return metrics.accuracy_score(predictions, v_labels)
# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y);
print "NB accuracy: ", accuracy; # 94%, 65%, 60%, 60%

# Logistic Regression
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y);
print "LR accuracy: ", accuracy; # 96%, 84%, 94%, 100%, 97%
accuracy = train_model(svm.SVC(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y);
print "SVM accuracy: ", accuracy; # 54%, 48%, 48%
# Random Forest
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print "RF accuracy: ", accuracy # 94% ,97%, 94%, 85%
# Extereme Gradient Boosting (not from scikit-learn)
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xvalid_tfidf.tocsc());
print "Xgb accuracy: ", accuracy; # 82%, 91%,92%

9. Choose a top model and retrain on all data (not splitting to training and validation set) and save it

# convert cat ("accident/non-accident) into category_id 0,1
df['category_id'] = df['cat'].factorize(sort=True)[0]
labels = df.category_id
features = tfidf.transform(df.text)
model = linear_model.LogisticRegression()
model.fit(features, labels)
import pickle
pickle.dump(model, open("model.pkl", 'wb'))

Steps by Steps to Run in Production

1. Get new documents

(all_documents, doctitles, docIds) = getNewArticles()

2. Run segmentation and format to panda data frame as a training step

(tokenized_documents, tokenized_document_title) = tokenizeDocs(all_documents, doctitles);
# concatenate title, body with space into tokenText
tokenText = [doctitles[i] + " " + all_documents[i] for i in xrange(len(doctitles))]
import pandas as pd
df = pd.DataFrame({id: docIds});
‘text’] = tokenText;

3. Load TFIDF pickle file (to match the vocabularies)

# needed to load pickle feature_100.pkl
def tokenizersplit(str):
return str.split();

# load tfidf.fit
tfidf = pickle.load(open('feature_100.pkl', 'rb'))
features = tfidf.transform(df.text)

4. Run TFIDF on new documents

features = tfidf.transform(df.text)

5. Load save the model and run predict on new documents

import pickle
loaded_model = pickle.load(open('model.pkl', 'rb'))
y_pred = loaded_model.predict(features)

6. Display or save the result

df['tag'] = y_pred


This post shows how to use scikit-learn to categorize documents into two categories. We use the tfidf approach typically done in NLP as features for the classifiers. We go over how to train different algorithms and look at the performance. Then deploy it to production.

With this limited label data, we got a fairly decent accuracy of around 93%. As we train more documents in production, we can manually verify add more labels to the data. Then we can train will a bigger dataset and re-evaluate the performance. We can go also start to look into the deep learning approach next to see if it can improve the performance. Until then, if you can read Khmer text you can see the result here: https://domnung.com/cambodia/accident.


After we increased the number of training documents to around 500 documents, the accuracy was increased. With XGBoost classifier, we are able to achieve 98% accuracy. We now use this algorithm instead. Here is the result for several runs with different training/validation ratio on unigram TFIDF.

# 20% validation set
NB accuracy: 0.960784313725
LR accuracy: 0.941176470588
SVM accuracy: 0.666666666667
RF accuracy: 0.911764705882
Xgb accuracy: 0.980392156863
# 25% validation set
LR accuracy: 0.934210526316
SVM accuracy: 0.684210526316
RF accuracy: 0.960526315789
Xgb accuracy: 0.986842105263
# 30% validation set
NB accuracy: 0.947368421053
LR accuracy: 0.940789473684
SVM accuracy: 0.644736842105
RF accuracy: 0.953947368421
Xgb accuracy: 0.960526315789

See the next article on multi-class classification.



Phylypo Tum

Software Engineer and ML Enthusiast