Creating a Khmer Language Model using BERT

11 min readApr 11, 2020

An Introduction to NLP for Non-Technical

I have been spending some time experimenting with machine learning with Khmer languages and started to play with one of the latest advances in artificial intelligence called BERT. In this article, I write about the history and an introduction related to natural language processing (NLP) to give you an idea of what the algorithm is but more importantly I detailed the process that it involves in its learning of the language. You can read more detailed technical approaches about the different algorithms here.

This article is also available in Khmer.

Introduction

Computer scientists have been trying to create an artificial intelligence (AI) program that can understand human language. We had seen progress in which a machine challenged humans in a game of Jeopardy in 2011. Watson, a program that was made by IBM, won Jeopardy games against the two human champions. This signified how the machine can learn the nuance of the English language well enough to play the game. The creators of this system took many years to train and refine the algorithms to specifically play this game.

BERT

One of the recent progress in this area is a generalized approach that an algorithm can learn a language without explicitly program-specific rules of the language. Advancement in this approach is known as BERT created by Google in 2018 that can perform questions and answers tasks better than humans. BERT, in this case, is not a Sesame Street character, but a machine learning (ML) algorithm that stands for “Bidirectional Encoder Representation for Transformers” which uses an underlying algorithm called “transformer”. This approach can train on a very large amount of text from Wikipedia and books. We just feed all these texts into the algorithm. With the data, the machine is able to learn the lexical (vocabulary), syntactic (structure of the text), and semantic (content or meaning of the text) of the language without us telling it what it is.

We won’t go into detail about how the algorithm works, but mention the structure of the algorithm. This type of machine learning approach is called “deep learning” which refers to an architecture that mimics how we think our brain works at the neuron level. A set of neurons in our brain take in some information and trigger other neurons downstream in multiple layers. Known as the father of deep learning, Jeffrey Hinton, a professor from the University of Toronto, started to popularize this type of machine learning in 2012 in computer vision with AlexNet algorithm that was able to identify objects in an image more accurately than before on the ImageNet competition. This competition regularly took place to highlight the improvement in new algorithms.

A popular approach used in computer vision is to first train the model on a large set of images. These images are already labeled such as ImageNet which has 14 million images with 20,000 different classes (different objects in the images). Then use this partly algorithm to train on images that you want to analyze. This increases the accuracy of just training on the final images of data. This approach is called transfer learning.

BERT uses this same transfer learning approach by pre-train the model on a large amount of text. The text is easy to get from the Internet such as Wikipedia without having to do any work to label the data as we see in the computer vision approach where a human has to identify what the object is in the image.

With this large amount of text (800 million words from books and 2.5 billion words from English Wikipedia), BERT creates its own task by feeding each sentence into the algorithm by eliminating some words. These missing words are called “masking” words. Then it tries to predict the missing words. By masking some input words and learning how to predict them, the algorithm learned the structure of the sentence and some form of the word’s meaning.

The advancement in this approach is due to faster computers that are able to run on larger amounts of text. In addition, the underlying algorithm “transformer” allows the algorithm to parallelize the processing better than earlier algorithms. With these combinations including the deep architecture that uses many layers with over 300 million parameters, this algorithm became the state of the art during its time.

This BERT architecture can apply to many languages besides English as long as you have enough text to train. In fact, Google has pre-trained models of about 100 languages. These languages are the top 100 languages with the largest Wikipedias. Khmer is not one of them since the number of articles written in Khmer in Wikipedia was ranked 155th of all the languages.

SQuAD

BERT was shown to be able to perform a question and answer tasks based on a given text better than humans. The dataset that we used is called Stanford Question Answering Dataset (SQuAD) with a score of 87% accuracy versus humans with 82% accuracy for exact match answer.

The SQuAD is a dataset created by Stanford University for the AI community to test how well a machine can understand the text. Based on a given text, there are several questions that are asked. To be able to answer correctly, the human or program has to understand the structure of the text.

For example: (Text passage on left, questions and the correct answers are on the right).

In this example of the second question, the answer is not in the same sentence as the question. So the computer has to figure out the answer from the previous sentence.

BERT on the Khmer Language

Now, we will train this algorithm on the Khmer language. We are going to evaluate whether the computer understands the text using a task called document classification. Given a document, the computer has to identify which class or category of a document belongs to. A popular example is for the algorithm to predict if an email is a spam or not.

Earlier, we created a Khmer news portal website that takes headlines from many Khmer sites and showed them on one page. In the process, we realized that there are so many traffic accident-related articles that overwhelm other important headlines. As a result, we created an ML classification task to identify each of the Khmer news articles they are traffic accidents. Then group them into a section so it will overwhelm the front page. So we will use the same task to see if the Khmer BERT algorithm can classify those articles well or not.

We have 820 articles that we chose and label each of them as ‘1’ if it is about a traffic accident or ‘0’ if it is not. This will be our training dataset.

Example text and its label as ‘accident’ or not.

In this process, we have 2 steps. In step one, we feed a lot of Khmer text into the algorithm BERT so that it can learn the Khmer language structure. This process is called pre-train the language model. In step two, we feed the news articles that we chose with the labels that tell the algorithm to learn to distinguish between class 0 or class 1. This step is called finetune for a classification task.

Step 1: Pre-train the Language Model

In step 1, we feed a large corpus of Khmer text into the algorithm. The text must allow the algorithm to split the text into words. Since Khmer does not use spaces to separate words, we need an extra step to process the text that helps the computer to determine which series of letters to form a word. This is called “word segmentation”. We will use an algorithm called “CRF” that has shown to perform well for the Khmer language. See my previous article for detail.

At this point, all of our text has spaces between words. The algorithm then generates a list of all the different words that are seen in the articles. Each of the words will be assigned to a unique number that the algorithm will use.

For example:

Input Text: “ឃាត់ ⁣ជនសង្ស័យ ⁣ម្នាក់ ⁣បន្ទាប់ពី”
Tokenized: [‘ឃាត់’, ‘ជនសង្ស័យ’, ‘ម្នាក់’, ‘បន្ទាប់ពី’]
Token IDs: [1585, 1438, 1062, 1482]

In the example above, we first split the text into 4 words by our segmentation process. Then it turns into series numbers with each word corresponding to the word in the dictionary we generated previously. These series of numbers become the inputs to the algorithm.

In this pre-train phase, it first chooses 15% of the word and indicates it as a “mask”. Then it learns to predict the mask word. As it compares the prediction to the actual word, it goes back and adjusts its weights or parameters (values in its connections) so that the prediction would be correct. It tries the next input and readjusts its weights again. It does this again and again until all the text was processed.

This is the part that takes a long time to run when you have a lot of data. The pre-train BERT model by Google for English runs over 3 billion words which took about 54 hours on Google hardware (TPUs) to complete. In our case, we only feed a small set of data in comparison. It took us only a few minutes for 334 thousand words of 1000 Khmer news articles.

This training process allows the system to learn the contextual structure of the sentence and how the word related to each other.

Step 2: Finetune BERT for Classification Task

In step 2, we will use our label data of 820 news articles we label as traffic accidents or not. Label 1 implies it is traffic accident-related and label 0 implies that it is not. We feed them the article text and its label. We update the model with extra architecture to be able to do the classification tasks.

We do not program anything specific about traffic accidents. The machine does not know anything about what label 1 or 0. It has to infer what are the differences between articles labeled as 0 and those with the label 1.

I can introduce a possible approach that the computer might try to solve this task. It may learn that articles of category 1 have the words “car” or “motorcycle”. But we know that some articles about “car show” or “motorcycles on sale” do not have the labeled 1. So it needs to learn to distinguish that “car show” or “motorcycle sell” related articles that contain the words “car” or “motorcycle” are not labeled as 1. It might find that articles of label 1 contain death or injury, but we know that articles about robbery and murder are labeled as 0. So it has to distinguish those as label 0. The algorithm will try to learn the features in the sentence, not just words as identified above, in order to make the distinction between the categories.

In fact, the approach mentioned of just looking at words is an older approach called “bag of words” where the program uses the different word counts in each article to make the classification. This approach does not perform well since it does not take into account the word orders or sentence structure that can be important in document classification.

To continue with our training process, the first step is to separate our articles into a training set and a validation set. The goal of the validation set is to set aside some articles so that the algorithm does use them during training. This can be used at the end to verify how well the algorithm performs. We saved 10% of the articles as a validation set. We use 90% of the total data of 820 documents. So the total number of articles is 738 for training and 82 for validation sets.

Training set: 738
Validation set: 82
Total Documents: 820

In the first step, it takes a set of input data and tries to adjust its weights to make the predictions better. After it goes through all 738 training data, the amount of error (loss) result is about 0.7 or 70% as shown in the blue line below. The process of going through the whole dataset is known as one epoch. With the current state, we test and see how it predicts the validation set. We see that the accuracy is about 50% as indicated as the red line in the chart below. This is not good. This accuracy is the same as if we were to randomly guess the answer.

Now we let it go through the same data again and readjust its weights and got about the same loss in the next 5 epochs. It performs about the same on a validation set of around 50% accuracy.

Then the sixth epoch shows a little better prediction. The validation set shows about 70% accuracy. This is a lot better than before but not great yet. Then in the seventh epoch, the accuracy went up to over 90%. Then in 9 epochs, it goes up to close 99%. As it goes through we see in the 17th epoch, it reaches 100% then 99% on the rest of the epoch. This accuracy beat our previous models.

Conclusion

We have seen that in 2018, the BERT algorithm has surpassed human capability in question and answer tasks in SQuAD 1.1 dataset. We see that we can pre-train BERT with Khmer text so it can learn the structure of the language. Then we finetune the algorithm with news articles data that we label as a traffic accident-related. Our result shows that the model can predict whether it is a traffic-related accident article with very high accuracy.

We may not know the exact approach that the algorithm used to distinguish whether an article is a traffic accident or not. But from our results, it was able to distinguish with high accuracy of 99%. It only misclassified about 1 for every 100 articles.

Does this mean that the algorithm understands the Khmer text? That depends on how you look at it. All we feed into the algorithm are just numbers that reference the Khmer word in our dictionary list. You can say it does not know anything about any of the Khmer words. But as we feed a lot of these numbers that signify Khmer words and its order. The algorithm knows some structure of those numbers that corresponded to our Khmer words. It knows enough about the structure to be able to distinguish if an article is a traffic accident-related or not.

You can see the full code which includes how we train Khmer language model from scratch:

https://github.com/phylypo/khmer-text-data/tree/master/bert-pretrain-from-scratch

You can find other datasets that you can use here: https://github.com/phylypo/khmer-text-data.

Other Notes

It would be a better illustration to have the algorithm generate Khmer text to illustrate its understanding of the language. Unfortunately, BERT is not meant for that. Other another algorithm ULMFiT that I trained uses a similar dataset and is able to generate text. Give the algorithm some words, it will predict the next few words that will make sense and even grammatically correct. See this site that I made from another algorithm here: http://ml.tovnah.com/khmer-ulmfit/.