Language Identification for very short texts: a review

Jade Moillic
Besedo Engineering Blog
16 min readMay 25, 2022

Co-authored by Hassan Ismail Fawaz

Photo by Vladislav Klapin on Unsplash

Introduction

The Language Identification (LangID) task, also called Language Detection, consists in determining the natural language that a text is written in (Lui & Baldwin, 2012). During the past decades, due to the extensive usage of text messages and social media platforms, the amount of plain text data has significantly increased (Toftrup et al. 2021). Therefore, LangID has become an important topic as, in social media texts, for example, the language is not specified yet very important to further analyze the data (Mitja, 2015).

In politics and socioeconomics, LangID is used for marketing purposes (Toftrup et al., 2021). In the field of Natural Language Processing (NLP), LangID can be the first step in any project. Indeed, if you do not know the language of a text, you will not be able to tell which language-specific model should be applied (Apple, 2019). Similarly, at Besedo, we have different models and filters for each language and use LangID to determine which language-specific path an incoming text should follow.

Today, a significant amount of user-generated text content contains a small number of characters, mostly due to the increasing use of social media that may limit the number of characters you can publish, such as Twitter or TikTok. For Twitter, the limit is 280 characters and was 140 until 2017 (Twitter, s. d.) and the limit of characters in a TikTok comment is 150 (What Are Character Count Limits on Social Media & Email?, 2021). Even if a given method achieves good results on large texts, it does not necessarily generalize to short texts which are known to be much more challenging for the existing LangID methods (Jauhiainen et al., 2018).

Although LangID has been extensively studied in the literature (Comparison of Language Identification Models, 2021; Jauhiainen et al., 2018; Lui & Baldwin, 2012; Toftrup et al., 2021) benchmarks were often limited to a single dataset while comparing two methods. Moreover, academic papers usually do not mention the inference time of these methods, which is a really important topic for practitioners constrained to real-time usage (Comparison of Language Identification Models, 2021). Finally, the surveys showed a lack of interest in short texts, which is recently considered a much more challenging problem for state-of-the-art LangID techniques (Jauhiainen et al., 2018).

In this blog post, we evaluated a benchmark of different LangID methods on various datasets in order to determine the best model for short text language identification. Our review includes methods based on both classical Machine Learning approaches and state-of-the-art Deep Networks, evaluated on different datasets containing short texts originating from social media or reviews. To compare the methods we used two criteria: accuracy and inference time.

Our results showed that two models come out on top in terms of accuracy. However, given the production requirement of low latency, one model appeared to have the clear upper hand.

Background

Before going into the benchmark, we define a few of the key concepts we will evoke in this blog post.

Text: a text is the original words of something written, printed, or spoken, as opposed to a summary or paraphrase in linguistics (Nordquist, 2019). For computer scientists, a text is a sequence of characters.

Short text: for this study, a text is considered short if its length is under 100 characters.

Label: a tag attributed to a text (e.g. positive/negative label for sentiment analysis).

Dataset: a set of texts where each text is associated with a ground truth label.

Text classification: given a dataset, design an algorithm or perform a manual annotation to predict the correct label of a corresponding input text.

Language Identification (LangID): a special case of text classification where the label is actually the natural language in which the input text was written.

Language: a method of communication that can be spoken and/or signed

Script: a set of alphabets used to write down a language

Datasets

We found seven LangID datasets for this task: SEPLN-TweetLID14, Open-subtitles-v2018–100k-per-lang, WiLI-2018, Language Detection (Kalvelage), Language Detection (Saji), Tatoeba-sentences-2021–06–05 and papluca/language-identification.

When considering a dataset for our review, an important requirement would be for the dataset to include some examples whose length is less than 100 characters. Language Detection (Kalvelage) and WiLI-2018 did not have enough texts under 100 characters so we had to put them aside. Thus, after filtering based on the latter requirement, we ended up with five datasets to work with. In the following sections, we describe each one of the five datasets in details.

The following table summarizes the different characteristics of our chosen datasets for benchmarking.

Comparison table of the five datasets used in our benchmark

TweetLID14

This dataset was created for the TweetLID 14 task that consisted of identifying the language of a text. The dataset contains 13k tweets written in languages from the Peninsula: Basque, Catalan, Galician, Spanish and Portuguese and they also added English. The tweets were collected during March 2014 and the authors only extracted tweets geolocated in the Iberian Peninsula.

OpenSubtitles

We used this dataset created for a language identification survey, Comparison of Language Identification Models (2021). It is a subset of 100k texts of 45 languages from the OpenSubtitles dataset made in 2018. OpenSubtitles contains texts from movie and TV subtitles, which means that the texts are very short (average of 28 characters).

Language Detection

Language Detection is a dataset hosted on Kaggle that Saji (2021) collected from Wikipedia. It is a small dataset of only 10k texts for 17 different languages. The average length of texts is 124 characters.

Tatoeba

This dataset is a dump of Tatoeba Sentences made on the 6th of May 2021 by the authors of the Comparison of Language Identification Models (2021) blog post. Tatoeba Sentences was created to collect sentences and their translations in different languages. Due to its purpose, the dataset contains only short texts with an average of 35 characters per text, written in 398 different languages. In this dump, the most represented languages are English, Russian, Italian and Turkish.

Papluca language identification

This dataset was created during The Hugging Face Course Community Event of 2021 to collect enough texts to train a language detection model. To this extent, they used data from three sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT and took only texts written in 20 languages.

State-of-the-art Approaches

In this section, we present the most recent and used techniques for LangID.

langid.py: An Off-the-shelf Language Identification Tool

Originally published in Lui, M. & Baldwin T. (2012), langid.py is considered one of the first approaches to democratize the task in question. The authors mainly focused on three aspects of the proposed technique:

  1. Fast response time given an input text — thus motivating the use of naive Bayes as a multinomial classifier.
  2. An off-the-shelf tool explaining why the module is presented in a single Python file that can be fairly easily used by any end-user.
  3. Unaffected by special characters such as HTML symbols etc … that are ignored and filtered when preprocessing the input text.

Detecting 97 languages, the method is based on a strong set of predefined features that are computed using Information Gain that is applied to a different set of n-grams. The latter features are then fed to a Naive Bayes classifier for training using a corpus of various texts originating from different microblogs. The results back then were quite impressive given the level of competition in 2012. However, as you will see in the next subsection, the ultimate NLP package (called FastText) was published by Joulin A. et al. (2017).

FastText: Bag of Tricks for Efficient Text Classification

FastText was originally published by Joulin A. et al. (2017) as a library for general text classification. Later on, Edouard Grave published in his blog post a new open-source model for language identification that is able to recognize more than 170 languages almost instantly.

The model was trained on three main datasets: Tatoeba, Wikipedia and SETimes. Apart from using a huge amount of data, this model makes use of subwords features to achieve high competitive accuracy. For example, if the given word is skiing then the corresponding subwords or n-grams are generated and used as input features for the model: skiing, ski, kii, iin, ing. A key advantage of generating subwords is that misspelt words can be corrected somehow since not all subwords will be misspelt.

In addition to its high accuracy, models like FastText employ two main techniques of model compression to decrease the inference time: (1) weight quantization where we map weights from large floats to smaller ranges thus taking up less space in memory; (2) feature selection where we basically remove input features that are irrelevant for the classification task.

Sine 2017, FastText has been one of the most widely used approaches for LangID evident by its robust performance and reliability across various types of datasets.

CLDv3: Compact Language Detector v3

Designed by Google to run on Google Chrome, CLDv3 is on its third version as of August 2020. The model covering 107 languages (slightly behind its counterpart FastText), is based on neural networks as its backbone.

First, it extracts different n-grams from the input text, which are later fed to an embedding layer that would project each unique n-gram to a fixed dense vector. Then it averages the embedding of these n-grams by weighting each one with its frequency of appearance in the original input text. Finally, the average embedding per n-gram is concatenated to form the input for the subsequent classical multi-layer perceptron which outputs a probability distribution over the 107 classes (languages). The following figure depicts the general architecture of Google’s CLDv3 model.

CLDv3 architecture by Google (source: GitHub — google/cld3)

XLM-Roberta-Base Language Detection

In 2022, we cannot be talking about an NLP task without mentioning the famous transformer-based architectures. This is why Lucas Papariello took it upon himself to investigate the use of the most recent multilingual transformer based model called XLM-Roberta published by Conneau A. et al. (2020). The model was trained on the Language Identification dataset hosted on huggingface. The model supports 20 languages only and is considered to be one of the state-of-the-art approaches for various NLP tasks not limited to text classification such as question answering and named entity recognition.

LSTM LID: Long Short Term Memory for Language Identification

Although most of the previous approaches achieved tremendous results for LangID, they lacked a focus on short text classification. Given the fact it is much more challenging to detect the language of shorter sentences, Apple published recently a blog post describing their internal solution for LangID. By focusing on shorter texts, Apple is pioneering the research for language detection on very short strings with their proposed LSTM-LID solution. However, Apple did not publish the corresponding code thus motivating Toftrup M. et al. (2021) to reproduce the results of LSTM-LID while open sourcing their code.

The method first applies Unicode based written script identification, and then forwards the text to the corresponding LSTM based architecture to classify the language. For example, if the written script is Latin, then Network Latin is used etc…

For each script, we would have an independently trained network that would predict the language given the input text. Specifically, each character would pass through the LSTM network, for which we predict a single language. Finally, to aggregate the results across all characters, Apple proposes the use of a max-pooling style majority voting to decide the dominant language of the input string. However, the authors did not manage to reproduce what Apple was proposing. Therefore, Toftrup M. et al. (2021) propose to sum over the linear layer’s output values at each time step (e.g. character) and apply the softmax operation to obtain a final prediction (e.g. language) for the whole input string.

The following figure depicts the architecture of Apple’s LSTM-LID model.

LSTM-LID architecture (source: Language Identification from Very Short Strings )

Evaluation

For our benchmark to be as complete as possible, we will focus our evaluation on two aspects: accuracy and inference time. The accuracy measures how often the model gives the right prediction for an input text. Accuracy is calculated as follows:

How to calculate accuracy

Inference time is a major subject for any practitioner planning to deploy a real-time model in production as it indicates how long it takes for a trained model to make predictions.

Given that not every model predicted the same languages, we had to come up with a solution to compare them accurately. To that extent, we evaluated the predictions on the 20 languages predicted by the LSTM model as the latter model supports the smallest number of languages. The languages compared in this benchmark are the following: Catalan, Czech, Danish, German, English, Spanish, Estonian, Finnish, French, Croatian, Hungarian, Italian, Lithuanian, Dutch (Flemish), Norwegian, Polish, Portuguese, Romanian (Moldavian), Swedish, Turkish.

Since we are interested in benchmarking how LangID models behave on really short text, we wanted to calculate the accuracy (y-axis) of the models’ predictions for each length of text (x-axis). The following graphs depict these results for each dataset.

TweetLID14

The models’ performance on SEPLN-TweetLID14

As we can see on the preceding graph, there are three distinct groups of models: the singleton CLDv3, the group langid.py/XLM-Roberta-Base Language Detection and the group LSTM/FastText which have similar accuracy. CLDv3 has the lowest accuracy without a doubt. The group LSMT/FastText have the highest accuracy for texts under 70 characters and we can observe that LSTM, FastText, XLM-Roberta-Base Language Detection and langid.py all have similar accuracies for texts over 70 characters.

OpenSubtitles

The models’ performance on Open-subtitles-v2018–100k-per-lang

The results on this dataset are smooth as it contains a significant number of examples under 100 characters (1.7 million) which facilitates our observation.

Clearly, LSTM works better for this dataset and XLM-Roberta-Base Language Detection also obtains good results for texts under 50 characters and similar results to LSTM when comparing the accuracy of the texts over 50 characters. This can be explained as the dataset on which XLM-Roberta-Base Language Detection is trained contained a lot of news headlines and image captions that can be similar to subtitles.

FastText has good results too but the cluster langid.py/CLDv3 is not performing well enough for very short texts, even if it has approximately the same accuracy as FastText for texts over 50 characters.

Even if LSTM performs better on this dataset, the evaluation is biased as the LSTM model is trained on it.

Language Detection

The models’ performance on Language Detection — B. Saji

For this dataset, the results are harder to compare as it includes only 3.7k texts under 100 characters. Nevertheless, the evaluation shows that the performances are quite alike for the LSTM, FastText and XLM-Roberta-Base Language Detection models and superior when compared to langid.py and CLDv3. We can also observe that both CLDv3 and langid.py are quite noisy for texts above 30 characters while the other three models are smoother with almost 100% stable accuracy.

Tatoeba

The models’ performance on Tatoeba-sentences-2021–06–05

Tatoeba is an interesting dataset for our benchmark as it contains 6 million texts under 100 characters. LSTM and FastText both clearly have better results on this dataset, followed by XLM-Roberta-Base Language Detection then langid.py. The model with the lowest scores for this dataset is CLDv3. However, some snapshots of this dataset were used to train FastText, which gives it an advantage. We also see that the five models have a similar evaluation for texts over 60 characters.

Papluca Language Identification

The models’ performance on papluca/language-identification

Because the XLM-Roberta-Base Language Detection model is trained on this dataset, we only considered the test set for our Benchmark to avoid biases, however, one cannot negate the fact that XLM-Roberta would still have an advantage when being compared to other models on this dataset. By observing the figure above, we can spot a high accuracy variance between models especially for extremely short texts, thus making it harder to determine the best model in these scenarios. CLDv3 and langid.py both have the lowest accuracy for texts under 60 characters and are both quite noisy for texts over 60 characters. XLM-Roberta-Base Language Detection definitely has the best performance but, even though LSTM and FastText are not trained on this dataset, they present a strong competition to XLM-Roberta who was trained on it.

During the aforementioned comparative evaluation, we observed the accuracy of the models on the different datasets. This observation showed that FastText and LSTM perform better on small texts than CLDv3, langid.py and XLM-Roberta-Base Language Detection even if the latter had good performances for most datasets. We can also see that CLDv3 and langid.py don’t perform well on short texts. Nevertheless, we saw that even if the models had differences in their accuracy, this difference decreased as the number of characters in the text increased, until it became non-existent, which highlights how challenging the task of LangID for very short texts is.

Finally, the inference time of a model is a decisive variable when choosing the final model to use in production. The following table represents the inference time of the different models for 10k texts chosen randomly from the Tatoeba dataset:

Inference time of the different models for 10k texts under 100 characters from Tatoeba

This table allows us to observe three different groups of models. XLM-Roberta takes the longest to make the predictions, which was predictable as Transformers based models are known to take a lot of time to make predictions. Meanwhile, langid.py and LSTM are one order of magnitude faster than XLM-Roberta and two orders of magnitude slower than CLDv3 and FastText, which are the fastest.

Inference time can also vary according to the length of the text so we calculated the inference time of the models for every length of the text and represented it in the following graph:

Inference time according to the texts’ length

The inference time of XLM-Roberta, CLDv3 and FastText does not increase if the text is longer, while langid.py and LSTM take longer to make the predictions if the text is longer. For LSTM this augmentation in inference time is obvious as it takes LSTM 0.0006 seconds to make the predictions for a 2 characters text and 0.004 seconds for a 100 characters text (approximately 10 times longer). Concerning XLM-Roberta, the length of the input text doesn’t change the inference time as all texts are mapped to the same length before being fed to the model so the result was expected.

Conclusion

Language Identification is an abundantly studied problem, specifically in NLP where it can be the first step in many projects to determine which language-specific model to choose. However, LangID on short texts is more challenging and the existing surveys on the subject often lack short text example datasets while limiting their comparison to a couple of techniques on one or two datasets. Therefore, in our blog post, we fill this gap by presenting one of the first comprehensive benchmarks of five LangID models on five different datasets focusing on short text examples.

We observed that the CLDv3 and langid.py models do not generalize well on short texts. Transformer based model (XLM-Roberta) performs well for the task at hand on some datasets, while having lower accuracy on some others, thus highlighting the importance of using multiple datasets when comparing multiple methods for LangID. LSTM and FastText both achieved the best performances for the overall benchmark while exhibiting similar accuracy. However, when evaluating their inference time, FastText was two orders of magnitude faster than LSTM. In addition, FastText covers eight times more languages ​​than LSTM, thus allowing us to safely conclude that FastText is the best solution to predict the language of short texts.

To go deeper into the work we did here, it would be interesting to compute the accuracy of the different models on the cumulative length instead of calculating it for each value. For future work, we propose further investigating transformer-based models on larger datasets with more languages, unlike XLM Roberta which showed good results although being trained on much smaller datasets compared to FastText. Moreover, it would be interesting to calculate the accuracy of the model not only according to its first prediction but the top 5 predictions for example. In addition, this field of research pioneered by Apple would push researchers to design machine learning models while specifically catering for short text LangID. Finally, we propose implementing the idea of ensembling under the real-time constraint (e.g. majority voting), different models instead of choosing a single technique out of the pool of various amazing language identifiers we have surveyed here.

Bibliography

Apple. (2019). Language Identification from Very Short Strings. Apple Machine Learning Research.

Language Identification from Very Short Strings

Besedo. (2022). Implio API Documentation. API-Documentation — Besedo

Comparison of language identification models. (2021). Model. Predict. Comparison of language identification models

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. Unsupervised Cross-lingual Representation Learning at Scale

Karikari, P. (2021). What is Data: A beginner’s guide to understanding what Data means. Medium. What is Data: A beginners guide.

Lui, M., & Baldwin, T. (2012). langid.py: An Off-the-shelf Language Identification Tool. Proceedings of the ACL 2012 System Demonstrations, 25‑30. langid.py: An Off-the-shelf Language Identification Tool

Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lind´En, K. (2018). Automatic Language Identification in Texts: A Survey. arXiv.Org. Automatic Language Identification in Texts: A Survey

Joulin A., Grave E., Bojanowski P., and Mikolov T.. 2017. Bag of Tricks for Efficient Text Classification. Bag of Tricks for Efficient Text Classification

Mitja, T. (2015). Evaluating language identification performance. Twitter Engineering. Evaluating language identification performance

Nordquist, R. (2019). What Is a Text in Linguistics? ThoughtCo. What Is a Text in Linguistics?

Papariello L. (2021) XLM-Roberta-Base Language Detection. papluca/xlm-roberta-base-language-detection · Hugging Face

Pranoto, J. A. (2019). Step-by-Step Text Classification — Tokopedia Data. Medium. Step-by-Step Text Classification

Toftrup, M., Asger Sørensen, S., Ciosici, M. R., & Assent, I. (2021). A reproduction of Apple’s bi-directional LSTM models for. arXiv.Org. A reproduction of Apple’s bi-directional LSTM models for…

Twitter. (s. d.). Counting characters. Docs | Twitter Developer Platform. Counting characters

What Are Character Count Limits on Social Media & Email? (2021). Capitalize My Title. What Are Character Count Limits on Social Media & Email? — Capitalize My Title

--

--