MT Metrics Explained: Deeply Transformative

7 min readJun 16, 2022

Last time when we visited the zoo of machine translation (MT) metrics, we promised to look at a different section, one that’s filled with more complicated beasts that have evolved only recently: metrics based on deep neural networks, more specifically the Transformer architecture.

Why even develop new metrics, though? The main reason is the rising overall quality of MT. Previous generations of MT struggled even with basic sentence structure and simple lexical ambiguity, so to some extent, just measuring the string overlap with a reference translation was “good enough”.

But as today’s MT quality is getting closer to human translation, there is a concern that relying on these simple metrics may mislead MT research. Especially when systems are high-quality and close in performance, relying on BLEU may in fact lead to selecting the worse system (see Section 6.1 in the paper). That’s not ideal when you’re an MT researcher trying to find out whether your new trick improves translation quality.

The metrics we will cover today are much more sophisticated and are able to capture meaning to some extent. Learning about them will challenge some of the assumptions we took for granted in the previous articles:

Most metrics just measure string similarity in some way.
Metrics require a reference translation to compare against.

The first is simply not true anymore. Deep neural networks don’t work directly with raw text. In the very first layer of the network, individual words are mapped into long vectors of real numbers (these are typically referred to as embeddings) and all subsequent computation uses this representation.

The second assumption is a bit more subtle. While all metrics can utilize reference translations, some of them can also operate in a “source-only” mode, where they evaluate how faithfully the MT output reflects the source sentence, and its fluency.

Spotlight: Multilingual embeddings

To help us better understand some of the metrics discussed today, it’s worth spending some time looking at the idea of semantic similarity and embeddings.

Coming back to the notion of embeddings, we can even go as far as to assign such a representation to whole sentences (or even paragraphs). Simply put, each sentence can be transformed into a point in a high-dimensional space (so it too can now be described as a long list of real numbers). The notion that a single vector (=embedding) could accurately capture all the subtleties of meaning famously bothered some researchers , but proved to be an extremely successful approach in practice.

*Multilingual embeddings in action: the German and the English sentences are mapped close to each other. Dissimilar English sentences are far apart.*

On a very high level, we like to say that the relative positions of sentence embeddings correspond to some notion of semantic “closeness” — sentences that say something similar will end up nearby.

And there’s more: if we train our neural network in a clever enough way, we can ensure that this mapping is multilingual. That is, similar sentences in different languages should be mapped to similar positions in the space. We get multilingual embeddings. (There are some doubts about how well this holds in practice though.)

BERTScore

BERTScore is probably the oldest metric on our list, dating “all the way back” to 2019. The metric builds on BERT, a famous pre-trained Transformer model for language understanding. Essentially, BERTScore takes BERT embeddings of individual words from the MT output and the reference translation, aligns them and calculates their overall similarity (using cosine distance, a common measure of similarity between embeddings).

This metric only supports reference-based evaluation and does not require any training aside from the pre-training already done for BERT. (Although, while the metric is called BERTScore, the authors use other pre-trained models as well, such as RoBERTa and XLNet.)

BLEURT

With a name that’s a throwback to the most notorious metric in the field, an amalgam of BLEU and BERT, BLEURT also uses the pre-trained BERT model under the hood. Unlike BERTScore, the score is not simply the similarity between the embeddings. Instead, there is an additional fully-connected layer in the neural network that computes the final score from the embeddings, and this layer is trained to directly mimic existing human ratings of MT quality.

Thanks to the training, it shows a higher correlation with humans. However, there is a potential risk when using the metric for texts that are very different from its training data. Originally, the metric only supported English as the target language but language support was extended by switching to a multilingual model. The metric requires a reference translation.

COMET

While the previous two metrics are interesting and innovative, they haven’t been really adopted by the MT research community. The story is very different for COMET though, as Microsoft Research announced switching to COMET as their primary MT metric, after extensive evaluation of its performance.

In some ways, COMET is similar to the previous two metrics: it uses a pre-trained multilingual model and is trained to mimic human ratings of translations. Aside from various differences in the network architecture, its main distinctive quality is that it uses not just the reference translation but also the source sentence, and its authors demonstrate how this benefits the metric accuracy.

But possibly the biggest strength of COMET is its flexibility: several variants of COMET have been released, targeting different methods of human evaluation (direct assessment, HTER, MQM). In addition, COMET can be used in “source-only” mode, essentially turning into an MT quality estimation system with very good results.

COMET is still very new and as a trained, black-box system, it still needs to win the trust of the research community. Researchers are poking it from various angles and already managed to find a few weak spots. Also, in general, while the landscape of MT metrics has always been diverse and dynamic, MT researchers typically adopt new metrics only very reluctantly.

That being said, COMET has a lot of momentum now and it may well become one of the standard MT metrics going forward.

PRISM

PRISM is a bit different from the other metrics in this article. While it is based on the same deep neural network architecture, the Transformer, it does not use a pre-trained model such as BERT.

Instead, PRISM is actually a multilingual machine translation system under the hood. As such, the system is (in theory) able to translate any of the supported source languages into any supported target language.

But how does that help us with MT evaluation? The authors propose a clever trick: we can view the MT output as a paraphrase of the reference translation. And “paraphrasing is really the same task as translation”, except that the language stays the same.

So when we have an MT system, we can say something like “here is a source sentence [=reference] in language X and this is a target sentence [=MT output], also in language X. If you were to produce this translation [or paraphrase], how confident would you be about it”?

The technical term for “if you were to produce this translation, how confident would you be about it” is forced decoding, and it’s just a way to extract the probability of a translation from an existing MT system. In fact, PRISM stands for “probability is the metric”.

And since we can provide inputs in any language, we can also replace the reference translation with the input sentence (in the source language); that way, we get a quality estimation system — no reference is needed, similarly to COMET-source.

A note on quality estimation

Reference-free metrics such as COMET-source and PRISM-source are strong quality estimation systems, showing great performance in the WMT Quality Estimation shared task.

But when we evaluated them in Phrase, we found that they don’t really work for our customers’ data. This plot shows it quite clearly:

Phrase MTQE compared to COMET-source and PRISM-source on customer data.

Each column is the Spearman correlation between the QE prediction (coming from either Phrase MTQE, COMET, or PRISM) and the final chrf3 score of the translation. This evaluation was done on the document level and uses an anonymized sample of Phrase customer data. Both COMET and PRISM show much worse performance than our MTQE in all the tested language pairs.

Our working hypothesis is that the data domain is simply too different from what these metrics were trained on. Similarly to COMET, our in-house MTQE is built on pre-trained multilingual models. However, it is explicitly trained on our customers’ data to directly predict the amount of post-editing (measured as chrf3 between the MT output and the final translation). As such, it doesn’t suffer from the domain mismatch (and task mismatch) and so can produce much more accurate predictions.

Leaving the zoo

By now, you have learned about all the well-known animals of the MT metric zoo and it’s about time to go home. And with that, our take-home message could be something like this:

Traditional metrics are still important. Knowing their shortcomings, you should still use BLEU and chrf and maybe (H)TER. Take their outputs with a grain of salt, but always look at them.

Novel metrics based on deep neural networks are probably here to stay. If we were to make a bet, COMET seems like the most promising new standard for MT evaluation at the moment. It has its downsides, but its flexibility and its ability to accurately evaluate high-quality MT systems are very promising qualities.

Finally, when dealing with trained metrics, don’t underestimate the importance of in-domain training data. If you’re a Phrase user, your best bet for keeping MT quality in check, especially in scenarios without post-editing, is our in-house MTQE.