MT Metrics Explained: Feeling BLEU?
BLEU (BiLingual Evaluation Understudy) is by far the best-known automatic metric for machine translation evaluation. Despite its many shortcomings, it remains the de facto standard both in MT research and in the translation industry. In this article, I will try to highlight several issues that often cause confusion and hopefully dispel some of the common misconceptions surrounding this metric.
What Value of BLEU is “Good”?
Since BLEU is used as an automatic approximation of MT quality, a reasonable question might be “at what value can we say that the MT is high quality”? Unfortunately, the issue is a bit more complex:
First of all, individual BLEU scores do not tell us very much — there is no magical threshold score after which we can consider the MT “good”. In fact, a single BLEU value without additional context is essentially useless. This is because BLEU will vary widely depending on, among other things:
- The language pair.
- Content type/domain.
- The exact evaluation set.
- The engine type: is it generic or custom-trained?
This means that a BLEU score of 60 may actually be bad, while a BLEU score of 20 could be considered pretty good. It all depends on the context.
For example, in 2018 the Charles University English-Czech MT system achieved a BLEU score of “only” 26. However, despite this seemingly low score, the system was considered to be on par with human translators: it was capable of producing outputs that were either indistinguishable from, or preferable to, human-made reference translations. (With a few caveats, but that’s a topic for another time.)
So when working with BLEU, you should always look at the score relative to the score of another MT system working with the same content. BLEU can help you decide which generic engine to use, whether your custom engine outperforms some baseline, or whether the quality of MT pretranslations in your data has improved over time. Just don’t read too much into the actual values. (Unless you get BLEU=5 or 100, then you’re probably in trouble.)
Finally, it goes without saying that measuring BLEU (or any other metric for that matter) cannot fully replace human evaluation. For instance, manual annotation can tell you whether BLEU is low because your MT system produces garbage, or because your texts are complicated and the MT is actually doing well. And it will help you identify critical problems, even in cases when BLEU may be deceptively high.
Should You Use BLEU?
BLEU has been a popular target for criticism for many years.
First, we thought that BLEU doesn’t work well in the low-quality range, now we think it does not work in the high-quality range.
We also know that BLEU is terrible for evaluating individual sentences (while it may be “okay” for document-level evaluation). It also prefers fluency over accuracy, treats small morphological errors the same as completely mistranslated words, and largely ignores subtle phenomena such as coreference or negation.
Besides simple inertia, one of the reasons BLEU is still used may be its simplicity and usefulness for many real-world cases. If I train a custom MT system and its BLEU is 20 points below a generic baseline, I’m not going to point the finger at BLEU’s bad correlation with humans. Instead, I will go and fix the error in my training data that caused the quality to drop. In many cases, despite all its problems, BLEU is simply good enough.
There are many alternatives to BLEU and in a future post, we may take a closer look at the whole “zoo” of MT metrics with their various similarities and differences. Recently developed metrics based on large-scale multilingual neural networks are an especially interesting option. There is one alternative simple metric that deserves a mention here, though:
At Phrase, we typically avoid BLEU. Instead, we prefer to use chrf3 for these practical reasons:
- The metric works with characters instead of words, which for technical reasons makes it easier to deal with morphologically rich languages and even languages such as Chinese, Japanese, and Korean.
- It typically correlates with human ranking a lot better than BLEU on the sentence level, and slightly better on the document level.
- It’s just as simple as BLEU and open-source implementations exist (we recommend sacrebleu). There is no training or fine-tuning or complicated language-specific resources to deal with.
Simply put: if you could only use one metric, then we would recommend chrf3.