Chatbot Language Identification for the South African Languages

Published in

Feersum Engine

6 min readNov 2, 2017

Introduction

Virtual assistants and text chatbots are gaining popularity. However, for chatbots to be accessible to South Africans these software agents need to understand our local languages.

South Africa has 11 official languages belonging to a couple different language families. Afrikaans (afr) and English (eng) are Germanic languages. isiNdebele (nbl), isiXhosa (xho), isiZulu (zul) and siSwati (ssw) belong to the Nguni family of languages. Sepedi (nso), Sesotho (sot) and Setswana (tsn) belong to the Sotho-Tswana family of languages. Finally, Xitsonga (tso) belong to the Tswa-Ronga family and Tshivenda (ven) belong to the Venda family. Many of these languages are under-resourced and further work is required to build chatbots that are fluent in the country’s rich vernaculars.

Text language identification (LID) is an important early step in many multi-lingual natural language processing (NLP) pipelines. Given the short message nature of text based chat interactions and the possibility of code switching, the language identification system might only have 15 or 20 characters to make a prediction. However, relatively low LID accuracies could result for short pieces of text and any errors that occur early in an NLP pipeline might potentially be compounded by later processing steps.

PRASA-RobMech 2017

At the end of November, I’ll be presenting an LID algorithm at the 2017 PRASA-RobMech International Conference. The conference incorporates The 28th Annual Symposium of the Pattern Recognition Association of South Africa and The 10th Robotics and Mechatronics Conference of South Africa. This year it is held at the Central University of Technology Free State, Bloemfontein South Africa.

The algorithm I’ll be presenting combines a baseline naive Bayes text classifier with a lexicon based classifier. The Bayes classifier is used to classify text into a language family and the lexicon is used to improve the accuracy of classification of the text into a specific language within the family.

In this post I’ll describe the baseline naive Bayes text classifier which can already achieve quite high accuracies for text strings of 50 characters and longer. I’ll wait until after the conference to write a second post on the results of the improved LID algorithm that is more accurate for short pieces of text.

Naive Bayes Text Language Identification

There has been previous work on the topic of LID in the South African context. Most notibly work by Botha and Barnard on Factors that affect the accuracy of text-based language identification as well as work by Giwa and Davel on Language identification of individual words with joint sequence models. Previous work has shown that the naive Bayes text classifier with character n-gram features works comparatively very well for language identification.

A naive Bayes classifier is a popular classifier based on Bayes’ theorem for the probability of events. The naive version of the classifier assumes that the features (characters and words in the case of text) are independent. This is an ok assumption for free form text because the presence of a word in a sentence is not strongly dictated by the presence of other words.

The classifier is trained by providing it with many examples of sentences in the various languages one wants to be able to identify. These examples are called the ‘training data’. Once trained, a second set of example sentences, called the ‘testing data’, is used to measure the performance of the classifier.

Previous work showed that one can easily achieve 100% accuracy for classification of South African languages given long pieces of text. However, for pieces of text of 100 characters or less the language identification performance can quickly drop to below 90%.

Results

The code as well as the training and testing data to generate the below results are hosted at https://github.com/praekelt/feersum-lid-shared-task. The data is derived from the text corpus data of the National Centre for Human Language Technology hosted by the Language Resource Management Agency. All figures used here are from the paper to be published at the 2017 PRASA-RobMech conference.

The figure below shows the performance of a naive Bayes language classifier using 5 character n-grams for text lengths from 240 to 15 characters. The figure shows the performance curves for using 1000, 2000, 3000 or 4000 training sentences per language. In general with this type of classifier more training data leads to improved performance, but with diminishing returns. F-score is a performance measure similar to accuracy.

LID performance against text length and number of training sentences per language.

As seen from the results above, the performance is quite high for classifying pieces of text of 100 characters and longer, but for shorter pieces of text the performance drops quickly. The figure below shows the confusion matrix for classifying pieces of text of 15 characters in length.

Confusion matrix of the baseline classier and a test set with strings of length 15 characters.

The confusion matrix is a view of the results that show how a 15 character piece of text labelled with the true label (on the left) of ‘xho’ for example is classified correctly as ‘xho’ 89.4% of the time. Find the row and column for ‘xho’ to see this result. However, the matrix also shows how often ‘xho’ texts are incorrectly classified as other languages. For example, short pieces of ‘xho’ text are classified incorrectly as ‘zul’ 5.7% of the time. Find the ‘xho’ row and the ‘zul’ column to see this.

The less ‘confusion’ there is between the languages the more diagonal the confusion matrix would be. A diagonal matrix has 1.0 on the diagonal and 0.0 everywhere else. The confusion matrix and how diagonal it is is a powerful tool to make sense of a classifier’s performance.

It is interesting to note from the above figure that the confusion, the widening of the diagonal, is much greater between languages of the same family. This is evident from the greater confusion within the dashed family blocks drawn on the figure. The figure below shows a confusion matrix with the results combined to only present the language families.

Confusion matrix for classifying 15 character text strings into their language families.

The confusion matrix is now very close to diagonal and the overall accuracy of classifying 15 character text strings into their language families is 99.2%. The naive Bayes classifier is therefore quite good at classifying even short pieces of text into their language families.

From the confusion matrix one is therefore able to discover that the LID performance could perhaps be improved by augmenting the Naive Bayes classifier with other more targeted classifiers. Such a combination of classifiers is known as an ensemble and the additional classifiers need, for example, only be good at classifying languages correctly within a specific language family.

Conclusion

The baseline LID model presented here has below 1% error for pieces of text of 50 characters and longer. The classifier also trains and tests in just 90 minutes on a single core of a 3.30 GHz i5 CPU, uses below 2GB of RAM during training and the trained model is only 50MB in size. Long sentence language detection therefore seems to be a solved problem for at least the 11 official South African languages. Note that this statement really only holds if the training data is from a similar domain as one’s production environment.

After the PRASA-RobMech conference I’ll do a post on our paper there and how to improve the performance for shorter pieces of text. If you are working on a LID algorithm please do apply your own LID algorithm to our testing and training data available at https://github.com/praekelt/feersum-lid-shared-task. It would be great to get a shared LID task going.