Wat lang iz dis? Language Identification of User Generated Content

Devanshi Bhatt
Spectrum Labs
6 min readAug 11, 2022

--

Language identification may sound like an easy task to automate, but it’s proven to be challenging along several dimensions. This can rapidly propagate errors and inefficiencies down the NLP pipeline within multilingual text. For example, if language identification is used to match text to languages a downstream model is trained on, or routed to labelers that understand the specified language. It is therefore important to understand these challenges and limitations of currently available tools, to realistically plan workflows that include language id.

User-generated content presents particular challenges in language identification. There are billions of people across the world using the internet — playing online games, posting messages to social media, and chatting on dating apps. The content on such platforms is created by diverse, international individuals who speak different languages and use different writing styles.

A given piece of text may include multiple languages, or be valid text in more than one language. It then becomes important to define the language detection problem as “what is the more prevalent language” or “what are the top-n languages ” rather than a naive classification task of “what language is this”. In this article, we outline several distinct challenges in detecting the language of user generated content and in evaluating performance of language detectors.

Detection challenges:

Writing style:
Articles on Medium or those in a newspaper are more formal than the conversational styles of comments on a social media platform or chats on a dating platform. Consider the difference between “There are many different types of automobiles — steam, electric, and gasoline — as well as countless styles.” and “U shud cum home ths weekend fr dinner wid my parents!” These examples illustrate how conversational text can often be out-of-distribution for a language detection model that was trained on more formally written data.

Length of text:
Longer text (typically more than 7 words) lends itself to a more precise estimation of the language identified, irrespective of which method is used to detect it. This is because more words provide more context, and therefore a higher likelihood of the text being matched to a particular language. Individual words are particularly challenging — for example, “blond” can be French or English, and there is no way to tell without additional context.

Multilingual text:
Messages on a dating or gaming platform may not necessarily be in one language, particularly in cross-cultural communities. This makes the task of language identification a complex classification problem. For example, “por favor come soon” has both Spanish and English words. Which language label would you give to this text? Langdetect and fastText classify this as Portuguese, whereas Google opts for Spanish and LinguaPY English.

Chat-speak vocabulary:
Conversational text abundantly uses chat-speak vocabulary. However, abbreviations with origins in English, like lol, haha, brb, omg are not limited to English text. When such internet slang is used in other languages, it adds noise to the language detection input. An example we’ve recently come across is “lol sei divertente.”, a combination of Italian text and “lol” chatspeak. For this example, both fastText and lingua-py correctly predicted Italian.

Code Mixing:
Another challenge for language detection is code switching or code mixing, when characters of one language are used to write a message/comment in another language. For example: “muje bahar jana hai”. The language of the message here is Hindi, but it uses English alphabets in text. Which language label would you give to this text? Fasttext predicts Italian, whereas LinguaPY predicts Polish.

Shared vocabulary:
Many words are shared across languages, a challenge that is often compounded with the challenge of short text outlined earlier. Accurate identification is important for downstream tasks since words may have different meanings in different languages. For example: the word “come” in English means “to approach or move” while in Italian it means “how”. Language pairs with high lexical similarity can be particularly challenging to distinguish — for example, Portuguese and Spanish have a lexical similarity of 89% [1]

The above challenges validate the importance of defining the language detection task beyond naive multi-class classification, as one of prevalence or top-n languages predicted. And suggest several areas where preprocessing or triage may help improve performance of current tools.

Performance Evaluation challenges

The challenges of language identification for user-generated content are not limited to the detection stage. To evaluate the performance of your language detector on a test dataset, you will need a labeled dataset to calculate metrics like accuracy or precision. Labeling your data for language can be costly for a sizable dataset reflecting the range of languages relevant to your language identification system. In addition, it can be challenging to find qualified labelers for a wide range of languages, particularly for low-resource languages. Also, since the language of user-generated text is often unknown, it is first challenging to route the text sample to appropriate native speakers.

Due to these challenges, there are very few publicly available benchmarks for language Id, nor are there many datasets for this task in general. We reference the two most common repositories for NLP datasets to evaluate publicly available datasets: Papers with Code and Hugging Face. Papers with Code identifies 3 benchmark datasets and 14 overall datasets. Of the 3 datasets with significant usage in the research community, OpenSubtitles is the largest benchmark dataset, with 2.6 billion sentences across 60 languages. Universal Dependencies contains 104 languages, with a range of dozens to thousands of sentences per language. Common Voice includes 60 languages, but is primarily a voice dataset, with transcriptions available but no sentence quantities listed.

Beyond limitations in number of languages or data quantity, these datasets are also out of domain for user generated text, and a model evaluated on these datasets will run into many of the challenges outlined above. The Hugging Face dataset repository does not have a separate category for language identification datasets, and the handful of language id models with a Dataset card also were out of domain for user generated text.

Limitations of off-the-shelf language detectors

Many off-the shelf language identification tools like Google Detector & Translator, Facebook’s fastText, lingua-py and langdetect are commonly used in research and academia. When it comes to user-generated text, even these popular techniques sometimes fail in detecting the correct language. Here are a few examples which represent several of the detection challenges described above.

language identification of the off-the-shelf techniques

The first example here is “ty bro,” which is a short version of “thank you brother.” This phrase gets mis-detected by 3 out of 4 language detection techniques.

In the second example, all the off-the-shelf techniques mis-detect the text as French. This is likely due to the shared vocabulary problem, where force is a valid word in both English and French. This is further compounded by the short length of the sentence.

Above examples and the corresponding mis-detections by the off-the-shelf techniques validate the challenges of detecting language of user generated content described in this blog.

Conclusion:

We have outlined several challenges inherent to language detection prediction and evaluation on user-generated content, as well as limitations to several popular tools. Our first recommendation to remedy this is to treat language detection as a task with multiple predictions rather than attempting to predict a single language. Further, taking a page from the data-centric AI movement, we propose training and/or evaluating models with labeled user-generated text rather than formal text, if that is the type of data you will be working within your ML system. As well as running inference on longer blocks of text when feasible.

If you are interested in chatting about language id or related NLP challenges with someone at Spectrum, reach out on our contact page!

References

[1] Ethnologue, Languages of the World. United States, 2002. Web Archive. Retrieved from the Library of Congress,<www.loc.gov/item/lcwaN0021868/>.

--

--