Language Detection in Short, Messy Texts

Jon Purnell
Spectrum Labs
Published in
5 min readFeb 11, 2022

Language detection doesn’t receive a lot of attention in natural language processing (NPL) circles, and that’s a shame because it plays a vital role on Trust & Safety. When it is deployed, language detection AI is typically trained on longer forms of text — things that a writer will naturally check for clarity and accuracy prior to posting or sharing with anyone. And yet, there’s all sorts of toxicity that can be expressed in messy texts or posts that are three to five words, max.

For instance, a Trust & Safety professional told one of our colleagues that while she was employed at a major social media company, women in India were reporting “take a bus” as threatening. It turns out, there was a heinous crime committed against a woman on a public bus that had shocked the nation. Lacking much-needed context, the platform’s model couldn’t detect the problem.

Short texts are a gaping hole in Trust & Safety today, and it’s a tough challenge to solve because it’s much more ambiguous than longer-form text. This content is written impulsively, full of noise, and really difficult to classify.

Unique Challenges to Short, Messy Texts

Many platforms have a multinational user base, and in such cases, the Trust & Safety tools need to analyze many texts and posts in many languages. But short texts mean fewer signals, which in turn means less opportunity to disambiguate words that show up in multiple languages.

Take the word “die” as an example. In English it means to end life, but in German it’s the common article “the.” It’s easy to see how a short phrase with the word “die” and some slangish words can sound innocuous in English but threatening in German, and vice versa. In English the word “goddess” denotes a female god or an extraordinary woman, but in other languages it refers to a sex toy. So the first challenge with short text language detection is finding a way to interpret and categorize such phrases.

Complicating matters further, some users intentionally add ambiguity through purposeful misspellings or leet speak which can easily slip past wordfilters and keyword/RegEx tools. Emojis can also contribute to the overall intent of a text in ways that wordfilters can’t detect, much less parse out into an appropriate representation.

And then there’s the fact that toxicity can occur without a single bad word or explicit threat written. Take the phrase, “I’m going to take you out.” The phrase is completely innocuous if shared between two people who are dating, but threatening if shared between estranged spouses.

Another challenge is homonyms, something we experienced here at Spectrum Labs. One of our Chinese-speaking data labelers had labeled a specific word as hate speech, but when we reviewed it, we couldn’t understand why. We then asked one of our team members who is fluent in Chinese, and she confirmed that the word in question was a homonym, and not explicitly toxic.

Short texts are significantly harder to label data correctly, infer behavior and accurately detect language, and that causes a lot of trouble for Trust & Safety teams. The real task at hand with moderation isn’t to find a particular sequence of characters; it’s to find the particular intent expressed by a person in a post or text. That’s where things get really tricky.

In order to understand intent, we must first be able to detect language, which in short form text is composed of made up words, misspellings, poor grammar and emojis. So not only do you need to interpret these individual emojis or these individual words, you also need to contextualize those words in relationship with each other.

These challenges make it really hard to solve just with a straight machine-learning approach.

Off the Shelf Solutions Don’t Work

Because short messy text doesn’t follow the vernacular or rules of formal writing, off-the-shelf tools don’t work on them. Google Translation is an excellent tool, and can do a fine job translating a page that the writer took great care to ensure it conformed to the basic rules of grammar. But it isn’t very accurate with short text and it results in a lot of misclassifications.

Companies like Google and Amazon offer classifiers (message level AI) that address toxicity, but many social media and online community companies prefer not to use them. Each platform has its own definition for what is and is not allowed. A platform meant for children will have a very different definition than an online game that caters to young men.

A test of Google’s AI demonstrates the challenge. Last year, Google’s Perspective API ranked a drag queens tweets as more toxic than David Dukes, mostly because the former used the terms fag, sissy and bitch multiple times. These words are not offensive within this community, and are often used to express affection.

And if your platform is used by speakers of a language that isn’t native to your Trust & Safety team, there’s very little chance their lists will be adequate to the task at hand.

Solving the Challenge: What’s Needed

We see four areas that will help Trust & Safety monitor short messy texts more accurately:

  • More Context to Strengthen the Signal. In scenarios where Trust & Safety has fewer words and fewer characters — combined with novel forms of expression — understanding intent is monumentally harder. By bringing in more context, however, you can improve your accuracy and understanding of these texts.
    One of the contexts we look at is the platform itself. If a text was posted to Reddit, we look at the subreddit in which it appeared. Other contexts to look at include language used in recent messages, both in the conversation and by the speaker, self-identified attributes in profiles and geo-location.
  • Resolve Language Coding /Leet speak /Misspellings. It’s helpful to pre-process text so that it can be translated into text that’s actually usable. Many platforms attract specific types of users who have their own words and expressions you’ll need to incorporate into your model. Additionally, your moderation team may be able to supply you with a list of code words.
    The blog What is Leet Speech? offers tips for understanding the language, and even links to a Leet speak translator.
  • Train and Build Out Custom Models. The data in an off-the-shelf dataset or global classifier probably won’t be representative of your users since it is trained off of highly curated language, such as Wikipedia articles. On the whole, it’s better to get training data from your platform and use it to build out your models. This will also allow you to identify new words and phrases as they grow in popularity.
  • Native Speakers. Hire people who are native speakers of the language of the content they’ll be asked to evaluate. A speaker who isn’t linguistically and culturally fluent may be unaware of the particular nuances, euphemisms or culturalism of a given language, or how the context of a word or phrase can affect its meaning.

Trust & Safety tools are rapidly advancing, but too many are not looking at a major source of toxicity: short texts and posts. And yet the negativity and toxicity within these short, messy texts can have a significant impact on the user experience. Users join platforms with specific reasons in mind, but if they’re confronted with snide, mean or toxic messages they’ll leave for another.

To learn more about Spectrum Labs solution for multi-language moderation, view our website.

--

--