Evaluating Machine Translation Models with T5 and Marian

8 min readAug 26, 2023

Machine translation is the task of automatically translating text from one language to another. With recent advancements in natural language processing (NLP), machine translation models have become increasingly accurate and efficient. In this blog post, we will explore how to evaluate the quality of machine translation models using two popular models: T5 and Marian.

T5

T5 (Text-to-Text Transfer Transformer) is a transformer-based model that can be fine-tuned for a variety of NLP tasks, including machine translation, text summarization, and question answering. T5 is unique in that it can be trained on a wide range of tasks using a single architecture and a unified text-to-text format. This makes it a versatile model that can be used for a variety of NLP tasks.

T5 is recommended to use when you need a model that can be fine-tuned for multiple NLP tasks. It is also useful when you have limited training data, as T5 can be pre-trained on large amounts of data and then fine-tuned on smaller datasets for specific tasks.

Marian

Marian is a sequence-to-sequence model that is specifically designed for machine translation. It uses an encoder-decoder architecture with attention mechanisms to generate translations from one language to another. Marian is unique in that it can be trained on multiple languages simultaneously, making it a versatile model for multilingual translation tasks.

Marian is recommended to use when you need a model that is specifically designed for machine translation. It is also useful when you need to translate between multiple languages, as Marian can be trained on multiple languages simultaneously.

T5 is recommended for multi-task learning and limited training data, BERT is recommended for text classification tasks and large training data, and Marian is recommended for machine translation and multilingual translation tasks.

Setting up the Environment

Before we can evaluate machine translation models, we need to set up our environment. We will be using the transformers library from Hugging Face, which provides pre-trained models for a variety of NLP tasks. T5Tokenizer requires the SentencePiece library because it uses the SentencePiece algorithm for subword tokenization. Subword tokenization is a technique used in NLP to break down words into smaller units called subwords. This is useful for handling out-of-vocabulary words and reducing the vocabulary size, which can improve the efficiency of NLP models. The SentencePiece algorithm is a popular subword tokenization algorithm that is used in many NLP models, including T5. The SentencePiece library provides an implementation of the SentencePiece algorithm that can be used for subword tokenization. T5Tokenizer uses this library to tokenize input text into subwords, which are then used as input to the T5 model. We will also be using the nltk library for calculating evaluation metrics such as BLEU and METEOR.

To get started, we need to install the required libraries:

!pip install transformers sentencepiece nltk

Next, we need to import the necessary modules:

from transformers import T5Tokenizer, T5ForConditionalGeneration, MarianTokenizer, MarianMTModel
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate import meteor_score
import time

Evaluating Machine Translation Models

We will be evaluating two machine translation models: T5 and Marian. T5 is a transformer-based model that can be fine-tuned for a variety of NLP tasks, including machine translation. Marian is a sequence-to-sequence model that is specifically designed for machine translation.

T5 Model

Let’s start by evaluating the T5 model. We will use the T5Tokenizer and T5ForConditionalGeneration classes from the transformers library to load the pre-trained T5 model:

t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')

Next, we need to define the input text that we want to translate:

input_text = "translate English to French: I like to eat pizza"

We can then encode the input text using the T5 tokenizer:

input_ids = t5_tokenizer.encode(input_text, return_tensors='pt')

The input text is first tokenized into a sequence of subwords using the SentencePiece algorithm. This breaks down the input text into smaller units that can be processed more efficiently by the T5 model.
The subwords are then converted into token IDs using the tokenizer’s vocabulary. Each subword is mapped to a unique token ID that represents that subword in the T5 model’s vocabulary.
The resulting sequence of token IDs is returned as a PyTorch tensor, which can be used as input to a T5 model.

The 11 subwords generated by the T5 tokenizer for the input text “I like to eat pizza” will depend on the specific tokenizer used and the vocabulary it was trained on. However, I can provide an example of what the subwords might look like using the default T5 tokenizer provided by the Hugging Face Transformers library.

Here’s an example of what the subwords might look like:

['▁I', '▁like', '▁to', '▁eat', '▁p', 'iz', 'za', '</s>', '<pad>', '<pad>', '<pad>']

In this example, the input text has been tokenized into 11 subwords. The first subword is “▁I”, which corresponds to the word “I” in the input text. The second subword is “▁like”, which corresponds to the word “like”. The third subword is “▁to”, which corresponds to the word “to”. And so on.

Note that some of the subwords are prefixed with “▁”, which indicates that they are the first subword of a word. This is a convention used by the SentencePiece algorithm to distinguish between whole words and subwords.

If the input text is “I like to eat pizza”, and the T5 tokenizer is initialized with a maximum sequence length of 128, the PyTorch tensor returned by

t5_tokenizer.encode(input_text, return_tensors='pt')

might look like this:

tensor([[  216,   851,    19,   329,  1247,  1230,   358,   358,   358,   358,
           358]])

In this example, the tensor has a shape of (1, 11)indicating that there is one input text with a sequence length of 11. The tensor contains the token IDs for the encoded input text, with each ID represented as an integer value in the tensor. The first token ID is 216, which corresponds to the start-of-sequence token. The remaining token IDs represent the subwords that make up the encoded input text. For example, the second token ID (851) corresponds to the subword “▁I”, the third token ID (19) corresponds to the subword “▁like”, and so on.

T5 is a transformer-based model that is implemented in PyTorch. PyTorch is a popular deep learning framework that provides efficient tensor operations and automatic differentiation, making it well-suited for training and inference of deep learning models.

When we pass the encoded sequence as a PyTorch tensor to a T5 model, we can take advantage of PyTorch’s efficient tensor operations to perform inference on the model. This can be faster and more memory-efficient than using a Python list or other data structure.

A flowchart that summarizes the process of encoding text using a tokenizer and passing it through a machine translation model

To generate the output text, we can use the generate() method of the T5 model:

start_time = time.time()
outputs = t5_model.generate(input_ids)
output_text = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
end_time = time.time()
model_time = end_time - start_time

Finally, we can calculate the METEOR score for the translation using the meteor_score() function from the nltk library:

reference_text = "J'aime manger des pizzas"
meteor_score_t5 = meteor_score.meteor_score([reference_text], output_text)

The METEOR score ranges from 0 to 1, with higher scores indicating better translation quality.

Marian Model

Next, let’s evaluate the Marian model. We will use the MarianTokenizer and MarianMTModel classes from the transformers library to load the pre-trained Marian model:

marian_tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
marian_model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-fr')

We can then define the input text and encode it using the Marian tokenizer:

input_text = "I like to eat pizza"
input_ids = marian_tokenizer(input_text, return_tensors='pt').input_ids

To generate the output text, we can use the generate() method of the Marian model:

start_time = time.time()
outputs = marian_model.generate(**input_ids)
output_text = marian_tokenizer.decode(outputs[0], skip_special_tokens=True)
end_time = time.time()
model_time = end_time - start_time

The generated token IDs are then decoded into human-readable text using the decode()

Finally, we can calculate the METEOR score for the translation using the meteor_score()

function from the nltk library:

reference_text = "J'aime manger des pizzas"
meteor_score_marian = meteor_score.meteor_score([reference_text], output_text)

The METEOR score ranges from 0 to 1, with higher scores indicating better translation quality.

from transformers import T5Tokenizer, T5ForConditionalGeneration, MarianTokenizer, MarianMTModel
from nltk.translate.meteor_score import meteor_score
import nltk
import time


nltk.download('wordnet')
# Load the T5 and Marian models and tokenizers
t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')
marian_tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
marian_model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-fr')

# Define the input text
input_text_t5 = "translate English to French: I like to eat pizza"
input_text_marian = "I like to eat pizza"

# Encode the input text using both tokenizers
t5_input_ids = t5_tokenizer.encode(input_text_t5, return_tensors='pt')

marian_input_ids = marian_tokenizer(input_text_marian, return_tensors='pt').input_ids

# Generate the output text using both models
start_time = time.time()
t5_outputs = t5_model.generate(t5_input_ids)
t5_output_text = t5_tokenizer.decode(t5_outputs[0], skip_special_tokens=True)
end_time = time.time()
t5_model_time = end_time - start_time

start_time = time.time()
marian_outputs = marian_model.generate(input_ids=marian_input_ids)
marian_output_text = marian_tokenizer.decode(marian_outputs[0], skip_special_tokens=True)
end_time = time.time()
marian_model_time = end_time - start_time

# Calculate the METEOR score for each translation
reference_text = "J'aime manger des pizzas"
t5_meteor_score = meteor_score([reference_text], t5_output_text)
marian_meteor_score = meteor_score([reference_text], marian_output_text)

# Print the output texts and METEOR scores
print("T5 output: ", t5_output_text)
print("Marian output: ", marian_output_text)
print("T5 METEOR score: {:.2f}".format(t5_meteor_score))
print("Marian METEOR score: {:.2f}".format(marian_meteor_score))
print("T5 model time: {:.2f} seconds".format(t5_model_time))
print("Marian model time: {:.2f} seconds".format(marian_model_time))

Results

I got the translated reference_text from google translate as the translations are reviewed by contributors.

After evaluating both models on our translation task, we obtained the following results:

T5 output:  Je veux manger la pizza
Marian output:  J'aime manger de la pizza.
T5 METEOR score: 0.24
Marian METEOR score: 0.72
T5 model time: 0.12 seconds
Marian model time: 0.23 seconds

As we can see, both models were able to generate a reasonable translation of the input text. However, Marian achieved a higher METEOR score than T5, indicating that it produced a more accurate translation. However, Marian took slightly longer to generate its output than T5.

Conclusion

In this post, we explored how to evaluate machine translation models using two popular models: T5 and Marian.

BLEU (Bilingual Evaluation Understudy) and METEOR (Metric for Evaluation of Translation with Explicit ORdering) are both metrics used to evaluate the quality of machine translation output.

BLEU measures the similarity between the machine-generated translation and one or more reference translations. It does this by comparing the n-gram overlap between the machine-generated translation and the reference translations. The BLEU score ranges from 0 to 1, with a higher score indicating a better translation.

METEOR, on the other hand, is a metric that takes into account not only n-gram overlap, but also other factors such as synonymy, paraphrasing, and word order. It uses a combination of precision, recall, and alignment-based measures to compute a score that ranges from 0 to 1, with a higher score indicating a better translation.

In general, BLEU is a simpler metric than METEOR and is often used as a quick and easy way to evaluate machine translation output. However, it has some limitations, such as being sensitive to word order and not taking into account synonyms or paraphrases.

We used METEOR scores as evaluation metrics to measure the quality of translations generated by these models.

Evaluating machine translation models is an important step in building high-quality NLP applications. By using pre-trained models and evaluation metrics, we can quickly and easily assess the performance of different models and choose the best one for our specific use case.