Why did we choose fastText to identify the language of text at scale?

Published in

affinityanswers-tech

3 min readFeb 21, 2021

Recently we worked on a project where we had to detect the language for upwards of 230 MM user comments that we were processing. The goal of this processing was to extract Spanish, Portuguese, French, and Arabic comments for bespoke analysis.

As you might have known, in social media users can comment in whichever language is native to them. For instance, Twitter keeps adding new language support to the already large language support pool. That has already resulted in non-English tweets being 50% of the total. Determining the language of a text is no longer complex but figuring out which solution best applies to your given scenario is difficult given the plethora of options available. In our case, we were looking for a solution that can scale and can be extended to other languages.

Start with cleaning the data

Like in any data processing exercise, the first step is cleaning. Let us take the example of a set of tweets here.

'''
tweet_text
0  @ZackSnyder @DavidAyerMovies I always loved Mr...
1  @AOC Let’s go screw the left. https://t.co/1me...
2  @jameelajamil @TeaRose1536 @gragnolla You don’...
3  @ATEEZofficial @yunhxone Ireneee miralo que bo...
4  @ZackSnyder What Zack is referring to       vs...
'''

We can see that # tags, @ names and http://…, are present in the Twitter replies. These can mislead the language detection process. Hence we first removed them.

def filter_tweets(msg):
    regex = "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
    return ' '.join(re.sub(regex," ",msg).split())df["filtered_text"] = df["tweet_text"].apply(filter_tweets)
print(df["filtered_text"].head())'''
0    I always loved Mr JOKER I never saw Mr Joaquin...
1                              Let s go screw the left
2    You don t have to be an expert to recognize an...
3                          Ireneee miralo que bonicooo
4        What Zack is referring to vs Wonder Woman 765
'''

Now that we have cleaned the data let's move on to what we initially set out to do: language detection. We explored three different options for language detection.

Langdetect

Language detection library ported from Google’s language-detection.

from langdetect import detectdef lang_detect(msg):
    try:
        ln = detect(msg)
    except:
        ln = None
    return Nonedf["language"] = df["filtered_text"].apply(lang_detect)print(df[["filtered_text", "language"]].head())

LangID

langid.py is a standalone Language Identification (LangID) tool.

from langid.langid import LanguageIdentifier, model
lang_identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)def lang_detect(msg):
    try:
        ln = lang_identifier.classify(msg)[0]
    except:
        ln = None
    return Nonedf["language"] = df["filtered_text"].apply(lang_detect)print(df[["filtered_text", "language"]].head())

fastText

fastText is a library for efficient learning of word representations and sentence classification. We used fastText for language identification inspired by this post. Further, one can train fastText to identify the language using labeled data; however we did not have labeled data. fastText has a way to load pre-trained models, which works out best for our case as the language classification we need is something that scores of people have already figured out. fastText distributes two models which can recognize 176 languages, so we downloaded one of the model and used it for our need.

import fasttextmodel = fasttext.load_model("lid.176.ftz")def fast_detect(msg):
    try:
        ln = model.predict(msg)[0][0].split("__")[2] 
    except Exception as e:
        ln = None
    return lndf["language"] = df["filtered_text"].apply(fast_detect)

Performance Metrics

Total social media comments processed: 10,000

Average characters in each comment: 180 characters

Machine configuration: 8 CPU, 32 GiB Memory

# ------------ LangDetect ------------
%%timeit -n 1
df["language"] = df["filtered_text"].apply(custom_detect)# 1min 35s ± 43.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)# ------------ LangID ------------%%timeit -n 1
df["language"] = df["filtered_text"].apply(langid_detect)# 14.2 s ± 1.21 s per loop (mean ± std. dev. of 7 runs, 1 loop each)# ------------ FastTest ------------%%timeit -n 1
df["language"] = df["filtered_text"].apply(fast_detect)# 190 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why we choose fastText?

Scalability

If you want to detect the language of a large data set then use fastText as the performance gain is significant. We adopted fastText exactly for that reason as you can see from the performance metrics above.

Extensibility

The advantage of fastText is the ability to train a new language; say for example we need to add support for a less spoken (written?) language.

Ease of use

In cases where you need to just do a one-off project, where scale is not a criterion, you are better off using LangDetect or LangId.