Previously, we implemented a machine learning (ML) approach to categorize traffic accidents of Khmer news articles for our website. By using TF-IDF with XGBoost algorithm, we were able to achieve a 98% accuracy. Now we will try to use a Khmer language model to see if it can get a similar or better result.
Before we explain the details of the approach, we will introduce a few different algorithms in natural language processing (NLP) from an early day to the more recent models such as BERT. Then we will go into the detail for building our first Khmer language model. The source code and a web utility to illustrate the model are included.
This article will cover:
- Context-independent: Bag of Words, TF-IDF, Word2Vec, GloVe
- Context-aware: ELMo, BERT, ULMFiT
- Khmer language model using ULMFiT
For other algorithms that recently came out, you can see my survey here:
A Survey of the State-of-the-Art Language Models up to Early 2020
Our main goal is to create an algorithm that identifies if an article is a traffic accident just like a typical machine learning classification of spam email.
A simple approach is to search for a specific text in each article that contains certain string patterns. For example, we might look for the word ‘car’ or ‘motorcycle’ and ‘accident’. This approach required us to program all of the rules. This can be a very challenging task to make it work for many different articles. This implementation is not considered as a machine learning (ML) approach since we program all the rules explicitly.
Bag of Words (BoW)
An ML approach is to train the computer to create rules from the data we provided. The data is the list of articles including the labels of whether the article is a traffic accident or not. A simple ML approach would be to use a bag of words (BoW). First, we tokenize all the documents into a list of words. Then create a dictionary of all of the words that we see in the articles. This word list does not account for word. That is why we call it a bag of words. We need to keep track of the count of each word for each article so we would have something like below.
ID\Word apple buy car crash motorcycle … zoo Labelarticle_1: 0 2 0 2 1 … 0 accidentarticle_2: 3 0 1 1 1 … 2 not_accidentarticle_3: 11 13 44 1 30 … 43 not_accident
These word-count data become the feature that an algorithm can use to program the rules. As an example rule, if an article has at least two counts of the words ‘car’ or one count of the word ‘motorcycle’ and two counts of the word ‘crash’, then it is a traffic accident article. So article_1 would fit those criteria but not article_2 which is correct so far. There are quite a few algorithms that can help us create these rules such as logistic regression or random forest. We won’t go into detail on the algorithms but you can see the list in this previous post.
One of the issues with the count is that a long document would have more word count than a shorter article. So this can impact the accuracy of the rules produced by the algorithm. To make the values comparable between long and short articles, we can use an information retrieval approach called TF-IDF (Term Frequency — Inverse Document Frequency). Term Frequency (TF) measures how frequently the word is seen in an article relative to other words. Inverse Document Frequency (IDF) measures how important this term is relative to all the articles or documents. TF-IDF is defined as follow:
TF_IDF = TF * IDFWhere:
TF = count_of_a_term_in_a_doc / total_count_of_a_term_in_a_doc
IDF = count_of_all_doc / count_of_doc_has_that_term
This approach normalized the count and resulted in keywords that are important to a document.
Using some of the machine learning algorithms to classify the articles using TF-IDF data gave a fairly good result as we used in an earlier post with the XGBoost algorithm that gave us a pretty high accuracy.
These approaches do not take into account any relationship between words. For example, the word ‘car’ or ‘motorcycle’ has no relation. They are just a different word index in the word list. To give a better context of the term, we will go into the word embedding next.
Word embedding is a list of values that represents a word. Each number in the list signifies a measurement in a different dimension or a different topic. As an example of 3-dimensional values, the word ‘king’ and ‘queen’ might share a similar value for one of the dimensions, says royalty, while ‘king’ and ‘man’ might be related in the gender category. Then ‘apple’ might be completely different from those earlier terms but might be related to a word like ‘orange’ in the fruit dimension.
As an illustration, I made up some example values and dimensions for the list of terms mentioned:
Word king queen man woman apple orangeRoyal 0.96 0.97 0.03 0.12 0.11 0.07Male 0.92 0.12 0.98 0.06 0.05 0.1Fruit 0.04 0.06 0.02 0.07 0.94 0.98
To generate this word embedding we can use text data as input to the algorithm. Word2vec (word to vector) is one approach that takes a word from a sentence then finds nearby words from the same sentence and runs through the neural network to produce the values of each category.
For example, we want to feed this sentence to the training algorithm “the king and queen are coming to visit.”
First, we generate the list context and target word as follow:
- Choose a word in the sentence (input word)
- Then pair it to another word (target word) within a window size n. Let’s n=5, then choose any of the 2 words to the left and any 2 to the right).
- For ‘king’, then the pairs are: (king, the), (king, and), (king, queen)
- For ‘and’: (and, the), (and, king), (and, queen), (and, are)
- For ‘queen’: (queen, king), (queen, and), (queen, are), (queen, coming)
In the pairs of words, the first word is the input word and the second word is the target word.
Secondly, in this training process, we passed in those pairs of words to the neural network algorithm to try to learn to predict the target word based on the input word. We won’t discuss the detail of the neural network algorithm. As a result, the algorithm learns the structure and relationship of the words. Here is an example of word pair relationships trained on 783 words and 300 dimensions from Mikolov et al. paper.
This approach is called skip-gram by Mikolov et al. from the paper, “Efficient Estimation of Word Representations in Vector Space“ in 2013.
With these word vector (hundred of numbers per word), there is information about the words and their relationship such as syntactic and semantic. Now our classification algorithm has more information to generate better rules.
The paper show a better accuracy on sentence completion challenge with skip-gram with recurrent neural network language model (RNNLMs).
In 2014, there is another approach for generating a word embedding called GloVe (Global Vectors for Word Representation) from Pennington et al. from Stanford University. It aggregates the global word-word co-occurrence matrix from a corpus to produce word embeddings. GloVe's approach is to count up how many words that a word i (context word) appears in a different word j (target word). The goal is to define a context for word i and word j as to whether or not the two words appear in close proximity of N-word apart (window size). The encoding vector contains the ratio of the co-occurrence probabilities of two words explicitly known as a count-based method.
These charts above show higher accuracy for GloVe in comparison with Word2Vec approaches (Skip-Gram and CBOW).
These embeddings are usually in several hundred dimensions. In the chart above, GloVe authors use 300-dimensional vectors on 6-billion token corpora from Wikipedia text.
With these embeddings, the classification algorithms have a better context to learn the features. But these embeddings do not take into account the order of words or the word context of each sentence. Words with multiple meanings like the word ‘book’ can have different meanings depending on the sentence such as ‘I book a hotel’ versus ‘I read a book’. So word embedding are known as context-independent representation. This next approach will show the context-aware representation.
ELMo (Embeddings from Language Models) is a contextual embedding that takes into account the surrounding words. It models the characteristics of word usage such as syntax and how it is used in various contexts. ELMo uses a bidirectional LSTM (bi-LSTM) algorithm. Bidirectional implies the algorithm takes into account the words before and the words after it in both directions. LSTM is Long Short-Term Memory, a type of neural network that has a ‘memory cell’ that can maintain information in memory for long periods of time allowing it to learn longer-term dependencies.
Below is the context embedding between GloVe and ELMo biLM (bidirectional Language Model) where GloVe shows different words relative to the word ‘play’ but ELMo shows the full context on different meanings of the word ‘play’.
ELMo outperformed all previous algorithms described above in 2018 with state of the art results for several major NLP benchmarks including question answering, sentiment analysis, and named entity recognition (NER).
One of the big headlines that came out in October 2018 is about BERT (Bidirectional Encoder Representations from Transformers ). BERT shattered previous NLP records on multiple datasets including Stanford Question Answering Dataset (SQuAD). SQuAD is a reading comprehension question and answer of 100,000 questions from Wikipedia articles. The answer to each question is a free form text from the corresponding reading passage.
BERT captures co-occurrence information by combining a masked language model and a next-sentence prediction task. BERT generates the embedding for subwords which make the embedding size smaller than other models like Glove or ELMo from million words to about 30,000 subwords.
With the success of BERT, there were many more improvements on top of BERT that came out subsequently such as ALBERT, RoBERTA, TinyBERT, DistilBERT, and SpanBERT. These are tweaks to the algorithm to achieve event better results and met different use cases like ALBERT for smaller models size.
Unfortunately, Khmer is on the 155th list as of this writing. Aragonese is the 100th language that made into multilingual pre-train languages with 36,000 articles spoken by 25,000 Pyrenees in northern Spain. While Khmer has about 8,000 articles of the language that are spoken by 16 million Cambodians. So the Khmer language is not part of the pre-train multilingual BERT that we can use due to the small number of Khmer articles in Khmer Wikipedia.
To pre-train BERT model from scratch using Khmer text can require some resources. To perform a pre-training of the BERT-Base model from scratch on a TPUv2 will take about 54 hours according to Antyukhov. The author mentioned that the cost to pre-train the model is negligible with some free credit given to a new user. But according to github BERT repo, it takes about 2 weeks to train at a cost of about $500 USD on a single preemptible Cloud TPU v2. We did not try to use pre-train BERT at this time.
Earlier in May 2018, we see ULMFiT, “Universal Language Model Fine-tuning for Text Classification” by Jeremy Howard and Sebastian Ruder. While ELMo uses, character convolutions and can handle out of vocabulary words, ULMFiT uses AWD-LSTM. We will be using this algorithm to generate a Khmer language model and use it to perform document classification.
We going to introduce the concept of transfer learning. Transfer learning is the idea that you learn from a specific task then using it to apply to other tasks. We can use the large corpus to do unsupervised learning to learn word embedding, then use this to finetune models for classification or other tasks.
With computer vision, an algorithm can train on a large number of image datasets like ImageNet where it can learn many different features. Then it finetuned the model with the specific domain images. This approach improves accuracy rather than just training on the domain-specific images alone which may be a smaller number of images. In NLP, a similar approach has been used as seen in the pre-training process.
Transfer learning has been used before in the pre-training stage, but ULMFiT introduced a process to effectively fine-tune the language model for various tasks. ULMFiT can be learned from a large number of raw text (like Wikipedia) to predict the next word. Then we use the transfer learning to update the model to train on a specific dataset that may be quite small. By having the pre-train model, then we can train the classifier to perform a specific task with better accuracy.
LM pre-training stage is to train on general domain corpus that can capture the general features of the language. Then the LM fine-tuning trains on a domain-specific corpus that may be small to enhance the specificity of the domain language. Then the last stage is the classifier fine-tuning which preserves the low-level representation and updates the high-level weights using gradual unfreezing of the layers.
UMLFiT showed better error rates on text classification in IMDb dataset to determine whether a review is positive or negative (2-classes similar to our case of accident or no-accident).
ULMFiT for Khmer Language Model
We decided to use ULMFiT to pre-train the Khmer language model since the model is smaller and faster to train.
Continues our challenge with news articles classification, we want to implement a Khmer text lm to perform this task. We choose ULMFiT. The detail code is available in the Python notebook in the Github.
To summarize the approaches:
- Create a pre-train model on KM Wikipedia 7,600 articles
- Download Khmer Wikipedia — 7,600 articles, but usable is 2,487 article
- Setup CRF for word segmentation — use pre-train model
- Format the article, then segment use CRF
- Train the pre-train model (unfreeze a few layers, train a few epoch, unfreeze and train 10 epochs)
2. Finetune the LM with traffic accident data
- Load accident data into panda from a database (panda data is saved in Github)
- Segment using CRF and tokenize
- Finetune the LM with additional data (fit a cycle, unfreeze and fit another 10 epochs)
3. Create classifier to train the traffic accident data
- Train the label data — save 20% as test set, got 97.5% accuracy
- Reviewed most wrong, realize some labels are bad, fix and got 100% accuracy
4. Ensemble with backward LM with 1K article data — 100%
ULMFiT uses Average-Stochastic Gradient Descent (SGD) Weight-Dropped LSTM (AWD-LSTM). In our case, we have a vocabulary size of 53656. Each word has a 400-dimensional embedding.
AWD_LSTM: Encoder: Embedding(53656, 400)
- LSTM(400, 1152)
- LSTM(1152, 1152)
- LSTM(1152, 400)
LinearDecoder: Linear(in_features=400, out_features=53656)
Initially, we added 1K news articles to supplement the lower number of Khmer Wikipedia articles for pre-train the language model. As we test the traffic accident classifier, we got an accuracy of 0.9750 and F1 0.9677. This is comparable to our best algorithm with TF-IDF approach but we expected more. Then we try to supplement more data to the pre-train process and see if this helps. We used 5,000 news articles, but we get a lower accuracy of 0.9375 and F1 0.9206 instead. We expected the language model to improve and we are puzzle why it gets lower.
We also tried the backward model which train the language model backward. This can be used as an ensemble to combine forward and backward. This didn’t help. The best result is with a 10% test set on a backward model with 1K new articles as LM and give 0.98% accuracy with a 10% test set.
Then we dig deeper into which article it incorrectly classified. An article about a traffic accident in Banteay Meanchey with no casualties was predicted as 1 (Prediction = 1) where 1=“accident” and 0= “not an accident”. But the actual label was set incorrectly as not an accident (Actual = 0). So the model guessed correctly.
We went back to fix the bad labels. We also add more training dataset. We realize that there are gray areas where some article is about donation to a family in the traffic accident or after the accident, the authority found drug. Are these traffic accident-related? Our final result is 98.69% accuracy (or 1.31% error rate). This is using 785 articles with:
Train: LabelList (707 items)
Valid: LabelList (78 items)
Here are the top losses:
We can conclude this model is doing pretty well-identifying all of the traffic accident articles with 99% accuracy beating our previous result using TF-IDF with XGBoost. This model is a powerful tool even with limited pre-train data of about 3,000 articles that we used.
In applying some of the latest advances in NLP to the Khmer language model, we will build the language model from scratch using ULMFiT. Unlike the big model like BERT or GPT-2, UMLFiT does not require as many resources to train from scratch.
In the model that we built, we see that the model pickup language structure and shows some coherency in generating the sentences. This is quite amazing that we did not provide and grammar or word definitions at all. Given just a raw text the algorithm is able to pick up the language structure.
While playing with the pre-train Khmer language model, we can guess the underlying training data based on the sentence prediction that it outputs. We noticed the news writing format with the title and location of the report. The bible text also plays some role in the model when we input words related to religion. We didn’t have blogs or conversation format texts or many stories and books to make it a more diverse model. This can be a good endeavor in the future to crawl and collect more blog pages and add other conversational text and make this data available for future study. We also noticed the importance of having more articles in Khmer Wikipedia where many of the latest state-of-the-art multi-language models may decide to create a pre-trained model that we can use.
To test the accuracy in the Khmer language model, we do text classification on news articles to determine if an article is a traffic accident. We implemented an earlier approach using TF-IDF with XGBoost algorithms with very high accuracy. We got a similar performance using our language model. In fact, after fixing the mislabel we archived a perfect accuracy on our small dataset (500 articles with 20% test set).
You can explore the Python notebook that contains all the code and associated data in the Github repository. We also have pre-train models that we shared using Google Drive. The model files are too big for Github repository. With these model files, you can load the pre-trained model and use it without spending time training it.
You can open the notebook in Google Colab and run there directly from this page below and click on “Open in Colab”.
In addition, we created a web interface to test out model ability to do next words prediction. See http://ml.tovnah.com/khmer-ulmfit/.
The code in the Python notebook is based on this Fastai GitHub, specifically:
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
Jeffrey Pennington, Richard Socher, Christopher Manning. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. arXiv:1802.05365. 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. 2018.
Jeremy Howard, Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification. arXiv:1801.06146. 2018.