Chatbot Language Identification for the South African Languages (Part 2)

Bernardt Duvenhage
Feersum Engine
Published in
5 min readDec 14, 2017

Introduction

In Part 1 of this post I wrote about a simple naive Bayesian classifier for text language identification (LID). This baseline was used in a recent paper published in the proceedings of the 2017 PRASA-RobMech International Conference. The conference incorporates The Annual Symposium of the Pattern Recognition Association of South Africa and The Robotics and Mechatronics Conference of South Africa. This year, it was held at the Free State Central University of Technology in Bloemfontein, South Africa.

In this second part of the post, I continue to describe the improved LID classifier developed in the paper. The improved classifier is, in fact, a combination of two classifiers. The first of which, is the naive Bayesian classifier discussed in Part 1 of this post and the second, a lexicon-based LID classifier discussed next.

Lexicon-Based LID

A sentence may naturally be labelled by language based on the lexicon from which its words are taken. If more of the words are taken from one language lexicon than any other lexicon, one may assume that the sentence is from that language.

Lexicons for the 11 South African languages were built from the text corpus data of the National Centre for Human Language Technology hosted by the Language Resource Management Agency. This is the same data that was used to train the naive Bayesian LID classifier.

The figure below shows the confusion matrix for classifying pieces of text of 15 characters (about three to four words) in length using the lexicon-based LID classifier. If you recall, the rows of a confusion matrix represent the true language labels of the pieces of text and the columns, the predicted language labels resulting from the LID classifier. The dashed blocks drawn on the matrix indicate the borders of the language families.

Looking at the isiZulu (zul) row for example one can see that isiZulu sentences are correctly classified 75.8% of the time and incorrectly classified as isiXhosa (xho) 8.8% of the time. Similar to the naive Bayesian classifier, the lexicon-based classifier shows good distinction between families, but much confusion within language families. The average accuracy of the lexicon classifier is 89.6%.

The figure below shows a confusion matrix with the results combined to only show the language families. Note how the languages from the Nguni family for example have been combined into a single row and column. The matrix therefore only shows the confusion between language families.

The lexicon-based LID classifier can distinguish between language families with a relatively high average accuracy of 99.1%. However, from this matrix one can see that there is still notable confusion between the Sotho-Tswana and Germanic languages as well as between the Tswa-Ronga and Nguni languages. Looking at the Sotho-Tswana row for example, one can see that Sotho-Tswana sentences are incorrectly classified as being from the Germanic languages 2% of the time.

A Cascade of Two Classifiers

The lexicon-based LID classifier is less accurate than the naive Bayesian classifier, which achieved an average LID accuracy of 93.0%. It is however a well-known fact that a combination (called an ensemble) of two classifiers can achieve a higher performance than either of the two classifiers on their own.

The PRASA-RobMech paper describes a cascade of the naive Bayesian and lexicon-based LID classifiers. A cascade is a type of ensemble classifier, which is a concatenation of multiple classifiers.

The way the cascade works is that the naive Bayesian classifier is used to first classify the text into a language family. If the lexicon-based classifier then shows one language within the family to be more likely than any other within the family then that result is chosen as the final language label. Otherwise, the naive Bayesian result is taken as the language label.

The intuition behind the order of the classifiers in the cascade is firstly, that the naive Bayesian classifier is relatively good at classifying a piece of text by language family. Secondly, the lexicon should be useful to distinguish languages within a family, which typically have largely mutually exclusive sets of frequently used words.

The figure below shows the confusion matrix of the results of the cascade classifier. The average accuracy of the cascade is 95.2%, which is higher than both the naive Bayesian classifier and the lexicon-based classifier.

As mentioned in Part 1 of this post, the better the LID classifier’s performance, the more diagonal the confusion matrix will become. The matrix above is more diagonal than the matrices of the individual classifiers, but some confusion within language families is still present.

Conclusion

The lexicon LID classifier on its own achieves an average accuracy of only 89.8%. However, when used in a cascade with the baseline naive Bayesian LID classifier, it results in an overall reduction in error. The resulting short sentence LID accuracy is 95.2%, which is a 31% reduction in LID error over the best performing single classifier. The code, as well as the training and testing data to generate the above results are hosted at https://github.com/praekelt/feersum-lid-shared-task.

Lower LID accuracies may be expected for short text due to fewer text features being available during classification. In future, it could be interesting to estimate the performance ceiling of LID and find other classifiers that can improve the performance beyond the current cascade. Since much of the current training data is from relatively old government sources, another avenue for further work should also be to train the classifiers on more diverse and more modern text corpora.

--

--

Bernardt Duvenhage
Feersum Engine

Feersum Engine NLP & Machine Learning Lead at Praekelt Consulting, Toronto Area, Canada.