Creating a Khmer Language Model using BERT


Computer scientists have been trying to create an artificial intelligence (AI) program that can understand human language. We had seen progress in which a machine challenged humans in a game of Jeopardy in 2011. Watson, a program that was made by IBM, won Jeopardy games against the two human champions. This signified how the machine can learn the nuance of the English language well enough to play the game. The creators of this system took many years to train and refine the algorithms to specifically play this game.


One of the recent progress in this area is a generalized approach that an algorithm can learn a language without explicitly program-specific rules of the language. Advancement in this approach is known as BERT created by Google in 2018 that can perform questions and answers tasks better than humans. BERT, in this case, is not a Sesame Street character, but a machine learning (ML) algorithm that stands for “Bidirectional Encoder Representation for Transformers” which uses an underlying algorithm called “transformer”. This approach can train on a very large amount of text from Wikipedia and books. We just feed all these texts into the algorithm. With the data, the machine is able to learn the lexical (vocabulary), syntactic (structure of the text), and semantic (content or meaning of the text) of the language without us telling it what it is.


BERT was shown to be able to perform a question and answer tasks based on a given text better than humans. The dataset that we used is called Stanford Question Answering Dataset (SQuAD) with a score of 87% accuracy versus humans with 82% accuracy for exact match answer.

SQuAD 1.1 Leaderboard. Retrieved January 20, 2019, from

BERT on the Khmer Language

Now, we will train this algorithm on the Khmer language. We are going to evaluate whether the computer understands the text using a task called document classification. Given a document, the computer has to identify which class or category of a document​ belongs to. A popular example is for the algorithm to predict if an email is a spam or not.

Step 1: Pre-train the Language Model

In step 1, we feed a large corpus of Khmer text into the algorithm. The text must allow the algorithm to split the text into words. Since Khmer does not use spaces to separate words, we need an extra step to process the text that helps the computer to determine which series of letters to form a word. This is called “word segmentation”. We will use an algorithm called “CRF” that has shown to perform well for the Khmer language. See my previous article for detail.

Step 2: Finetune BERT for Classification Task

In step 2, we will use our label data of 820 news articles we label as traffic accidents or not. Label 1 implies it is traffic accident-related and label 0 implies that it is not. We feed them the article text and its label. We update the model with extra architecture to be able to do the classification tasks.


We have seen that in 2018, the BERT algorithm has surpassed human capability in question and answer tasks in SQuAD 1.1 dataset. We see that we can pre-train BERT with Khmer text so it can learn the structure of the language. Then we finetune the algorithm with news articles data that we label as a traffic accident-related. Our result shows that the model can predict whether it is a traffic-related accident article with very high accuracy.

Other Notes

It would be a better illustration to have the algorithm generate Khmer text to illustrate its understanding of the language. Unfortunately, BERT is not meant for that. Other another algorithm ULMFiT that I trained uses a similar dataset and is able to generate text. Give the algorithm some words, it will predict the next few words that will make sense and even grammatically correct. See this site that I made from another algorithm here:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Phylypo Tum

Phylypo Tum

Software Engineer and ML Enthusiast