MT Metrics Explained: A Visit to the MT Metric Zoo

Aleš Tamchyna
3 min readDec 2, 2021

--

In the previous article we took a look at BLEU, the best-known metric for automated evaluation of machine translation (MT) systems. Today I will explore the zoo of MT metrics a bit, looking at a few notable species — chrf3, TER, Meteor and BEER. Similarly to BLEU, these metrics mostly just compare the MT output string with one or more reference (human) translations.

I deliberately avoid the recent metrics based on pre-trained Transformers, such as BERTScore, PRISM or COMET. These deserve an article of their own.

Chrf3

chrf, or character-level F-score, is a simple but very effective metric which was proposed by Maja Popović in 2015. Informally, it measures the amount of overlap of short sequences of characters (n-grams) between the MT output and the reference. It addresses several shortcomings of BLEU:

  • Since it’s character-based, it’s not very sensitive to how the sentences are tokenized (=split into individual words/tokens). This fact makes it also easier to apply to CJK languages.
  • Words with errors in morphology may still get a partial reward for correct stem/root, making chrf more robust for morphologically rich target languages.

It has repeatedly been shown to correlate better than BLEU with human judgments of MT quality, especially on the sentence-level.

At Phrase, chrf remains our go-to metric for most use cases and this metric was also recently recommended by MT researchers at Microsoft as a solid complement to modern Transformer-based metrics.

TER

The translation error rate (or translation edit rate), published in 2006, is different from other metrics on this list in that it explicitly tries to estimate the amount of work required to turn the MT output into the reference translation.

Specifically, it quantifies the number of edit operations (insert, delete, substitute, shift) required to change the MT output into the reference translation. In principle, one could carry out these edit operations manually with a keyboard and a mouse, so the metric is sometimes interpreted as the required post-editing effort. Many MT quality estimation solutions also target TER for this reason.

Conceptually, the definition of TER is quite simple but in practice, computing TER exactly is not feasible and many edge cases must be handled. Phrase contributed an implementation of TER into the MT evaluation framework sacrebleu and most of the work went into making the outputs match the original Java implementation.

Meteor

Meteor is a complex metric that correlates well with human judgments. Originally released in 2005, development continued and brought significant changes in subsequent versions. Compared to other metrics on our list, it is notable for one particular reason: it does not just compare the MT output and reference translation as strings. Instead, it uses stemming and even synonyms and paraphrase tables to try and capture the meaning of the translations. Meteor will penalize the MT output less if the mismatched portions are synonyms (paraphrases) of the reference translation, as opposed to being completely unrelated.

While this was a very interesting property at the time, today, metrics based on deep neural networks achieve the same effect in a more robust way.

BEER

Besides having an appealing name, BEER is a good example of a trainable metric. Other metrics in this article are largely unsupervised — they don’t rely on labeled data (sets of MT outputs, reference translations and manually obtained quality scores) to adjust their various parameters.

BEER, on the other hand, uses machine learning and needs to be trained — it uses features to describe both the MT output and reference translation(s) and uses labeled data to learn weights for these features which maximize agreement with existing human rankings.

While it has shown very good performance in the WMT metrics task, there is potential concern that the metric may not generalize well to new settings. Indeed, as far as I know, BEER is not often used in commercial settings.

Conclusion

We could go into much more detail on each of these metrics, but the goal of this piece is to provide a high-level understanding of the best-known MT metrics. There are also many other notable metrics that we did not explore at all. Next time, I will describe some of the Transformer-based metrics which are all the rage now.

--

--