How you can improve language detection for free
At TheFork, we detect the language of reviews left by users.
Because of that, we can implement the appropriate moderation of their content, enable filtering, and build additional features that enhance our product.
Recently, we achieved a significant improvement in recognition.
And we did it for free :)
“Where should I start?”
I know. Nowadays, it’s full of ML models out there and it’s easy to feel overwhelmed by the plethora of providers.
Chances are that you already started a tool-cost analysis comparing services like Google Cloud Translation API, Amazon Comprehend, or Azure AI Language. And I am pretty sure you have realized soon that their price is quite relevant.
So this leads to a simple question: is it possible to do it for free?
The answer is Yes.
How
Companies like Google, Meta, Amazon, and Microsoft are well known for their products but few people know that they have open source projects.
In particular, in 2015 Facebook’s AI Research lab released fastText, a library focused on the implementation of text representation and classification.
The good news is that some pre-trained models are provided for free and, in this article, I am going to show you how you can use them to do language detection!
Quick benchmark
We tried fastText with this dataset and compared the results with our previous detector:
- 💪🏻 98.7% accuracy (vs 79%)
- 🤩 176 supported languages (vs 83)
What I am going to show you
Our expectation was to implement a detectLanguage endpoint that accepted:
- text, the string to be analyzed
- n, the number of languages we want to detect
For example, with the review
“We had an amazing Pizza in this Restaurant!”
and n=5, we want an output like this:
[
{
"language": "en",
"probability": 0.99023
},
{
"language": "nl",
"probability": 0.00182
},
{
"language": "de",
"probability": 0.00167
},
{
"language": "fr",
"probability": 0.00123
},
{
"language": "pt",
"probability": 0.00034
}
]
such that each prediction is associated with its confidence value.
Implementation in Node.js
In the Language Identification section, there’s a downloadable model called lid.176, because it’s able to recognize 176 languages in ISO format.
This is an offline model so it doesn’t need to be actively running to be updated in real-time.
You can move it in your repo, and then use node-fasttext to load it in memory at the startup of your service:
// service.ts
import { Classifier } from 'fasttext';
const model = path.join(__dirname, '/language-detection-model/lid.176.bin');
const classifier = new Classifier();
await classifier.loadModel(model);
From now on, you will be able to use that instance to do predictions.
As an example, this is how your endpoint behaviour could look like:
// detectLanguage.ts
export const detectLanguage = async (
text: string,
n: number,
) => {
const results = await classifier.predict(text, n);
/* results
[
{ "label": "__label__en", "value": 0.99023 },
{ "label": "__label__nl", "value": 0.00182 },
{ "label": "__label__de", "value": 0.00167 },
{ "label": "__label__fr", "value": 0.00123 },
{ "label": "__label__pt", "value": 0.00034 }
]
*/
return results.map(({ label, value }) => ({
language: label.replace(/^__label__/, ''),
probability: value,
}));
}
How we use this endpoint
Now you might be asking which prediction you should pick to do the detection.
Is it enough to take the most probable? In our opinion, no.
Each time a new review gets created, an asynchronous process gets triggered, causing a worker component to call this endpoint to assign one language, and we do it when the prediction is reliable.
What is a reliable prediction?
What follows is our strategy to interpret the result of the detectLanguage endpoint:
- take predictions p1, p2 with the highest probability
- if probability of p1 ≥ 0.7, then p1 is reliable 👍🏻
- else, compute
4. if probability of p1 ≥ 0.5 AND r ≥ 3, then p1 is reliable 👍🏻
5. else, p1 is unreliable 👎🏻
If we end up with an unreliable prediction, we write NULL in our database.
However, if we find a match between the unreliable prediction and the user language, we use p1 anyway.
Results
With this approach, we have been able to scan over 22M records, and we brought the percentage of reviews with a NULL language from 20% to 1% 🔥.
Unfortunately, we can’t calculate the percentage of accuracy, because we don’t have a source of truth that tells us what the real language of each review is.
However, after implementing these changes, we have seen our moderation improve, our filters become more effective, and we have unlocked the possibility of working on new features 🚀.