Word Segmentation of Khmer Text Using Conditional Random Fields

12 min readDec 18, 2019

Khmer (Cambodian) text does not use spaces to separate words. Characters are written from left to right consecutively without space between words. Native speakers can separate words effortlessly but teaching a computer to separate them is a different challenge.

This article will discuss different approaches to segment Khmer text. We will start with a dictionary-based approach and then going to a statistical approach like Ngram. Then we will cover Naive Bayes, Hidden Markov Model, and Maximum Entropy Markov Model. Lastly, we will cover the Conditional Random Field (CRF) and its implementation for Khmer documents in this article.

The previous state of the art for word segmentation on the Khmer document using CRF was 98.5% accuracy by Vichea et al.[2] We hope to accomplish similar performance using different features and approaches. You can see the full code and data in Github and you can run it using free Google Colab or other python notebooks. By making the dataset available I hope that others can help improve the performance. See the instruction on how to open and run the code here:

https://medium.com/@phylypo/open-python-notebook-from-github-9177ab819b53

Data and Python Notebook: https://github.com/phylypo/segmentation-crf-khmer

As an introduction, I will start to look into the detail of each of the approaches below. Navigate to each one to see the detail.

To recap from the above links, we start with an approach of using a dictionary to look up terms to match the word. This resulted in algorithms like maximum matching which is a greedy algorithm that finds the longest word and then maximal matching which considered all possible combinations and chose sentences with fewer words. These algorithms can get some acceptable performance for Khmer documents but there is room for improvement to allow the algorithm more context like the nearby words.

Then we see the probabilistic approach using Ngram. Similar to dictionary-based, we just add the frequency of how many times we see each word from our training data as for the case of unigram (single word). Based on their frequency we would try to find the word with the highest frequency. For bigram, we need a pair of words that appear together thus provide a better context. We can also have trigram which will have a list of 3 consecutive terms. As we use more consecutive terms, we get more context from the surrounded terms. But at the same time, the number of terms lookup also increased. Here you see that Ngram makes use of nearby words and its frequency. By having more context the algorithm can start to improve.

In Naive Bayes model, we looked at Bayes theorem on segmentation at the character level frequency instead of word frequency. Based on the probability of the character and its label (beginning letter, middle letter, or end of word letter), Bayes theorem predicts the probability of its label. Although, Naive Bayes makes a strong assumption that the labels are independent. For time-series data like sequences of character in a word, this approach would not be well suited.

As an extension to Naive Bayes, a Hidden Markov Model (HMM) takes into account the series and find the best probable path. HMM uses the emission matrix as in Naive Bayes which tells us the probability of (x|y) and transition matrix which gives us the probability between states. We cover the Viterbi algorithm which is an efficient way to finding the best path. HMM takes into account the series data and maximize P(x, y) and uses Bayes theorem to get P(y|x).

The Maximum Entropy Markov Model (MEMM) extends HMM as a model that uses conditional probability P(y|x). It is a discriminant model, unlike HMM which is a generative model. The discriminant model uses conditional probability rather than a joint distribution. MEMM used a maximum entropy algorithm (a general form of logistic regression) which is well suited for classification with Markov Model to extends to sequential data. But MEMM experiences a label bias problem where path tends to be chosen one route due to the local normalizer.

The final algorithm introduced here is the Conditional Random Fields (CRF). It is a partially direct model that uses a global normalizer to overcome the label bias problem. It models the dependencies between each state and the entire observation sequence. This means that the algorithm now can make sure all the surrounded sequences of characters in that sentence.

This will be the final algorithm that we will be focusing on how we use Linear Chain CRF to segment Khmer text.

Implementation CRF on Khmer Documents

In Linear Chain CRF, the main feature we will implement is the feature functions. This is the most flexible approach in terms of feature engineering comparing to other algorithms. The feature function will output 0 or 1 based on the input character. We can throw anything into the feature function. Before we go there, one of the most important aspects of machine learning is data. Let’s look at how can we get the training data.

Data Collection Approach

One of the big challenges in machine learning (ML) is getting and cleaning data. Some say this is 80% of all of the ML work. For this specific task, we have an advantage. We were able to crawl many Khmer new site for text. But the challenge is to segment these text so it can be used for training. Many of the articles already use zero-width space to separate words. This character (\u200B in Unicode) is not visible on the web browser but we can use it as a separator. Unfortunately, this is not always correct and some sites do not use it. So to make sure our data is clean, we will need to correct it. This can be very time consuming to do.

Fortunately, Vichea et. al. [2] already put so much effort on hand segment the data for their training of CRF. They produce a state of the art CRF model and made it available for the public to use. We were not able to get their training documents but we can use their tool to generate the segmented the text.

We still need to hand correct any mistake before using this data as our training set. But the effort would be a lot less and we only attempt to perform on a subset data and apply the correction to the rest of the data set.

Training Data

We crawled over a few dozen of Khmer news sites. Then we run the CRF tool from [2] to produce the segmented text. We save both the original and segmented versions for this study. We only need the segmented version for training.

We have combined the downloadable files into different batch sizes (100, 500, 1000, 5000, and 10000 articles). Each batch is compressed to save space in Github here: https://github.com/phylypo/segmentation-crf-khmer/tree/master/data.

The files with the name *_200b.zip are a newer batch that uses the Unicode zero-width space (\u200b) instead of space. Other file uses space. The challenge with using space to separate Khmer text is that it confusing with existing space between English text or spaces intended for other uses.

Filename Format

The file name starts with a unique 6 digit number as id, then follow by ‘_seg’ or ‘_orig’ to indicate original text or segmented text. Some new set of files has ‘_200b’ in the file name to indicate that we are using \u200b unicode character as a separator rather than space. This new approach distinguishes the existing spaces that we don’t need to predict for non-Khmer text.

Here is an example:

451807_seg_200b.txt : segmented text using 200b character
451807_orig.txt : original text

Old format file has *_set.txt which is the segmented text using space to separate words. Note that 10000-file batch does not include the original text.

Data Distribution

There are different topics that we categorize based on the link of the site. Here is the distribution of the different news topic:

This is the break down of the different news sites we crawled in 100 article batch.

Here is the number of characters per word distribution. The main cluster is around 2.5 to 7 characters per word. Note that each subscript is counted as 2 characters due to how the Unicode encoding works.

These data are based on the 100 articles batch.

Feature Engineering Used in [2]

As part of the machine learning process, we need to identify the features that the algorithm can use to perform prediction. The feature list will feed our feature function in CRF. In [2], Vichet Chea et. al. uses 10 features to identify the current character:

They use 5 different tags to label the training text as:

Below is an example of boundary notation from [2]. Each word is inside the curly bracket ‘{}’. Then for each word, there may be suffix denoted by ^ and prefix denoted by ~.

Khmer Character Cluster (KCC)

Before I detail our approach to feature engineering, let us look at a concept generally in Khmer document on segmentation called Khmer Character Cluster or KCC. KCC is an inseparable sequence of Khmer characters. This unit must stay together. For example, a Khmer vowel cannot be by itself. It must come after a consonant. So a KCC, in this case, is a consonant follows by a vowel. Another case is that a subscript must be preceded by a base character like a consonant. The subscript cannot be by itself and must always come after a base character.

Here is a complete rule:

<C|I> + [<Robat | Regshift>] +
{COEUNG + <C + [Regshift] | I + [Regshift]>} +
[[<ZWJ|ZWNJ>] + V] + {S} + [ZWJ + COEUNG + <C | I>]

Source: Detection and Correction of Homophonous Error Word for Khmer Language

It’s easier to see in as an example:

Different type of segmentation (Tum’s Thesis 2007)

KCC vs. Character Level

We want to test features using character level versus KCC. Since we will not separate character within a KCC, it makes sense to use KCC as it also provides more information.

To our surprise, the accuracy result shows that the character level approach has a better accuracy then KCC with 97.5% versus 97.4% on 100 documents with a 20% test set. Upon looking deeper, we realize that the accuracy measures the ratio of the number of corrected labels over total labels. Since KCC combines several characters into one, it has a fewer number of labels. Thus if the algorithm using KCC makes the same number of mistakes as the character level, it will get lower accuracy.

A better metric would be to look at a word accuracy, how many words were correctly segmented. With this, we see that KCC has a better accuracy of 96.9% versus 95.4% for character level. So we will use KCC in our final implementation.

Our implementation of KCC does not reflect the full rule of KCC. We made it a lot simpler to support any mistype. In addition, we add a feature to cluster number and non-Khmer text as one cluster since we don’t need the algorithm to segment those.

Features

We have 3 main sets of features with the following feature names:

‘kcc’ : a string of the current Khmer character cluster
‘t’: type/size of characters, if kcc then output size of the KCC
‘ns’: indicates if the text is of type ‘no space’

In addition, we also look up 3 clusters before the current text and 3 clusters after. This is equivalent to 7-gram.

Here is a sample feature output: create_kcc_features(seg_kcc(“កខ”))

[
{‘kcc’: ‘ក’, ‘t’: ‘C’, ‘ns’: False, ‘BOS’: True, ‘kcc[+1]’: ‘ខ’, ‘kcc[+1]t’: ‘C’, ‘kcc[+1:0]’: ‘កខ’, ‘ns+1’: False}, 
{‘kcc’: ‘ខ’, ‘t’: ‘C’, ‘ns’: False, ‘kcc[-1]’: ‘ក’, ‘kcc[-1]t’: ‘C’, ‘kcc[-1:0]’: ‘កខ’, ‘ns-1’: False, ‘EOS’: True}
]

Notice that kcc[+1] is the next cluster, kcc[-1] is the previous one. The value for ‘ns’ is either True for False. The ‘BOS’ is the beginning of a sentence, while ‘EOS’ is the end of the sentence.

Tools for CRF

We are not going to implement the Linear Chain CRF algorithm from scratch but rather use existing tools or libraries. There are several popular CRF toolkits:

http://www.chokkan.org/software/crfsuite/: A fast implementation of Conditional Random Fields (CRFs)
https://taku910.github.io/crfpp/ CRF++: Yet Another CRF toolkit

Several Python libraries provide support to CRFsuite, including python-crfsuite and sklearn-crfsuite. We try both and they have similar performance, but we prefer the sklearn-crfsuite since it has a similar sklearn usage format.

CRF++ is a command-line interface with features needed to be set in files and training data and output wrote to files. That was used by the Niptic Khmer Segmentations paper.

The tool we use to segment our training data is from Niptict Khmer Segmentation tool using CRF: https://niptict.edu.kh/khmer-word-segmentation-tool/.

Performance and Analysis

Our final data based on 10K docs — training on 20% test split:

The training set has 58,324 sentences, with an accuracy of 99.98%.
The test set with 14,581 sentences, with an accuracy of 99.74%.

So we use that as a final result of 99.7% accuracy or 0.3% error rate.

CRF can train on a different number of iterations. The longer the iteration the better the performance up to a certain point. Here is a performance on different CRF training iteration on 9K documents.

With different number of iteration, we also see the size of the model (pickle file size that saves the state of the model). So the higher the number of iteration, the smaller the file size. This indicates that many of the parameters may have dropped off as the iteration increased.

We also look at model size based on the number of documents we train. The larger the size of the training document the bigger the size. It indicates the more parameters need to fit a large number of documents.

Comparison to [2]

Our implementation here has an error rate of 0.3% while the performance from [2] is 98.5% accuracy or 1.5% error rate. Our implementation is significantly better in term of performance but our implementation only considers 2 tags (1: a beginning of a word, 0: otherwise) while Vichet Chea et al. [2] has 5 different tags to distinguish the prefix, suffix and compound word. So there are more options for the algorithm in [2] to learn and predict thus it is much harder.

Although, [2] shows the experiment result on 2 tags vs 5 tags has a very similar F-score implying that their performances are not too different.

Conclusion

We have covered a full suite of word segmentation algorithms. The performance of Linear Chain CRF is incredibly high with an error rate of 0.3%.

With the dataset and source code available on Github, I’m looking forward to seeing further improvement in word segmentation on Khmer documents. We only have one domain dataset but I’m sure others can share their data for this kind of research.

Github project: https://github.com/phylypo/segmentation-crf-khmer

See the instruction on how to open and run the code in Google Colab here: