Metrics for evaluation of translation accuracy

Published in

Trusted Data Science @ Haleon

7 min readApr 23, 2024

In this article, we’re giving an overview of the translation accuracy evaluation and comparison of the performance of three metrics — BLEU, ROUGE-L and METEOR against human evaluation. Our findings indicate that METEOR exhibits the closest alignment with human assessment, while the widely adopted BLEU metric falls short in comparison.

Background and use case

At Haleon, the demand for accurate translations is very high. As a multinational company, we operate around the globe and require translations to be done quickly and at scale—for packaging information, website content, quality and regulatory documents, consumer queries and more.

Mostly the information we’re translating is about our products, which means that we require a very specific type of text (e.g., package leaflet) and a specific choice of words. For instance, in most cases, we require our translations to be accurate and not deviate much from the original. For instance, you would not want to paraphrase Theraflu instructions a lot. We also need the translations to capture nuances of the context because the target audience for the texts ranges from healthcare professionals who would use a lot of medical terms like dentinal tubules, to customers who would not require that kind of terminology and would prefer a more accessible concept like dentin instead.

This is why we are developing an internal automated language translation tool. We’re basing it on a large language model (LLM) because they satisfy our requirements for accuracy and flexibility better than machine translation (MT) models.

As for any other machine learning (ML) task, the choice of metrics is crucial. With translation, accuracy metric is needed at multiple stages of project development:

Model selection;
Evaluation;
Model tracking in production.

We have assessed some of the most popular translation accuracy evaluation metrics and have chosen the one that suits our purposes most. Below we will summarise a “bake-off” we did for this article.

Accuracy for translation task

The way we measure the accuracy depends on the ML task. For example, in binary classification the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined.

Formula for calculating accuracy for a classification problem. — Accuracy in classification

To calculate that we must know what is a correct prediction. And here comes the main reason for the impossibility to simply lift and shift accuracy estimation approaches that work well for other tasks: there are typically multiple ways to communicate the same meaning. Moreover, there might be no single “right” translation at all, for example for very culturally specific phenomena, when a translator typically has to simply explain the piece of text in source language.

Let’s examine the following examples that put us in a pickle on how to evaluate the accuracy and penalise excessiveness / insufficiency.

Examples of cases where the difference in one word affects the accuracy in different ways: preserving the meaning and changing it completely. — Examples of caveats in translation accuracy evaluation

Typically for automatic translation evaluation we have the input text and human reference text — a human validated translation.

In both cases above the reference translation (expected) is on the right, and the model (actual) translations are on the left. The difference is in one word, but while “like” and “love” are synonyms, and communicate the same meaning with different intensity, replacing “often” with “not” in the second example changes the meaning of the sentence entirely. So in the second case we’d like to penalise one word inaccuracy more than in the first.

Units and metrics

For word sequences comparisons the n-grams are used, which is a sequence of n adjacent symbols in particular order. For example, one word is a unigram, two words — a bigram, and so on:

Picture of a sentence with highlighted parts corresponding to unigram (one word), bigram (two words) and n-gram (3 words and more). — Example of n-grams in a sentence

We can use them to calculate the overlap of n-grams between the actual and reference translation and get a score, the ROUGE-1, 1 standing for n-grams we used as a unit for comparison.

Let’s take the following example:

Reference (human): It is warm outside.
Output translation: It is very warm outside.

After breaking the sentences down into unigrams
4 unigrams in the reference: it, is, warm, outside;
5 unigrams in the output: it, is, very, warm, outside.

we can calculate precision, recall and F1-score (the ROUGE-1):

Formulas of Recall, Precision and F1-score and calculations for the particular example. — Recall, Precision and F1-score for example sentences

As you can already guess, this metric doesn’t consider the word order and doesn’t account for the meaning. For example, we’d get the exact same score, 0.89, for following pair:

Reference (human): It is warm outside.
Output translation: It is not warm outside.

In order to account for word order we can use larger n-grams, for example, bi-grams, with which we can calculate ROUGE-2 for our second example:

3 bigrams in the reference: it is, is warm, warm outside;
5 bigrams in the output: it is, is not, not warm, warm outside.

With this metric we’re still not covering the meaning of the words. To account for that other metrics like METEOR use stemming and synonymy matching along the standard exact word matching. In metrics like BLEU a set of good quality reference translations is used, instead of a single reference, to cover the different phrasing and synonyms.

Data and approach

For our metrics comparison we used a set of internal data, which are texts about Haleon products (e.g., toothpaste, nasal spray) or packaging information. Each sample consists of

a sentence long source text,
human reference,
translation,
native speaker score (1 to 5).

Dataset consists of 306 such samples. To avoid bias, we selected the samples for 7 language pairs. The native speaker evaluation is completed by 8 different people.

Metrics comparison

The metrics we chose for this exercise are

We are comparing them against a native speaker score. We believe the closer our metric is to a native speaker evaluation, the more accurate it is.

On a correlation heatmap we can see that the metrics correlate higher between each other than with a native speaker score. And on the heatmap it looks like all of them are quite similar:

Visualisation of correlation between Native score and BLEU (0.38), ROUGEL (0.43) and METEOR (0.43). — Correlation heatmap for the three metrics vs native speaker score

However if we look at error distribution, we can see how native speaker score is shifted towards right and how METEOR score distribution is way closer to the native speaker score distribution.

Visualisation of error distribution for the four scores. METEOR and ROUGE resemble Native score with right shift (more values around 0.8), while BLEU seems to have heavy left shift (more values are around 0.2). — Violinplot for error distribution for the three metrics the Native speaker score

Here we can clearly observe that the metrics ranking would look like this:

METEOR
ROUGE-L (close second)
BLEU

The BLEU metric would perform better if we could provide a set of human reference translations. This is unfeasible in our setup, and in general obtaining just one human reference per source sample is not an easy feat, so we lean towards the metrics that work well with one “correct” translation.

In the examples below we see that both BLEU and METEOR capture the accuracy of the first output. However due to BLEU focusing on precision, it is scoring significantly higher in the second case than in the third case. In the second translation the model is missing some words from the output and in third BLEU is penalising the word selection of the model translation. However, the words that differ are synonyms (“very” and “quite”, “say” and “tell”) and the meaning is maintained quite close to the reference. In general the two variants are quite close by meaning to the reference, which is better reflected in METEOR score.

Model translated (candidate) sentences and their corresponding scores with highlighted difference between the candidates and the human reference. — Example of candidate sentences and their corresponding METEOR and BLEU scores.

Conclusion

The significant limitation here is that all the metrics above require the human reference to compare the model output, which is not readily available, especially for model monitoring purposes. One solution is to have a set of samples with human references and with every model update we use them for “sanity checks” to ensure that the quality hasn’t degraded. Another solution is to prompt the LLM to evaluate the translation given the output.

For our use case, we selected the METEOR score due to its closeness to the native speaker evaluation. We are using it for model selection (out-of-the-box model evaluation) and validation of prompt engineering results. METEOR is also our go-to in automatic quality assurance tests to ensure that changes in the code base don’t have a negative effect on the accuracy of translations.

Regarding monitoring the model performance live in production, we’ve implemented a rating mechanism to allow users to rate and provide feedback to the generated translations. Although we have found the METEOR score to align most closely with human assessment, the native score remains the most accurate metric to evaluate the translation quality, and that’s what we stick to where possible.