Beginning of July, Meta AI published an outstanding work: No Language Left Behind (NLLB).
Update (v2): The v2 of the NLLB paper is out on ArxiV. Meta AI’s reaction to my review was very quick and cordial (despite the salt and errors in my review…). Comparisons spBLEU-BLEU and chrf-chrf++ have been replaced. Now, for most tables in the automatic evaluation section, NLLB compares its own scores with the scores copied from previous work. I believe that they did their best to make these scores as comparable as possible. They follow the machine translation evaluation standard… and thus I still strongly disagree that all these scores are comparable, but I won’t argue more about this particular paper. I will write another, more general, article on why we should stop comparing copied results. Unfortunately, I have many compelling examples from the scientific literature…
The following review has been written given the v1 of the NLLB paper. Some of my comments don’t apply anymore to the subsequent versions of the paper, but I think it is still worth reading if you are interested in better understanding, or discovering, very common pitfalls in machine translation evaluation.
NLLB presents a new translation model and datasets for 200 languages. This is a wonderful initiative that will definitely benefit many on the planet.
It is also a scientifically dubious work. In this article, I demonstrate that many of Meta AI claims made in NLLB are: unfounded, misleading, and the result of a deeply flawed evaluation. I will also show that, following Meta AI evaluation methodology, it is very easy to obtain even higher numbers than what they have reported.
This article doesn’t target a specific audience but rather anyone who is interested to understand how researchers in AI can make exceptional claims based on truly meaningless numbers. Hopefully, this won’t be too technical. I won’t go in-depth into all the problems in Meta AI’s work to keep it short and simple.
Meta AI released a scientific paper to fully explain and evaluate NLLB. In the paper’s abstract, they claim the following “Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.” In other words, NLLB would be better than previous work. I’ll explain BLEU below, but to give you some context, a…