Machine Translation Project by EMN

Elsa
3 min readDec 21, 2023

GDSC Primorska — ML/AI Workshop
Elsa Morina, Nikola Murgovski, Miha Rupar

Introduction

In this workshop, after getting introduced to Kaggle and Hugging Face, and after some hands on exploration of Large Language Models, we chose the topic of Machine Translation.

Machine Translation is the automation of translating text from one language to another by a computer. The goal of using this is to help people communicate in different languages and break the barriers.

Dataset

We decided to use the dataset with English words translated to French. We found this dataset on Kaggle and imported it into our notebook. After setting up the following packages successfully,

!pip install pandas
!pip install transformers
!pip install sacremoses

we imported pandas library and used it to add dataset to our environment.

import pandas as pd
df = pd.read_csv("/kaggle/input/en-fr-translation-dataset/en-fr.csv")

Here is an image to reference the dataset.

As you can see there are 22520376 rows and while we wanted to work on them all, we were limited to work on the first 1050 rows. We moved the model to GPU for a faster result.

df = pd.read_csv('/kaggle/input/en-fr-translation-dataset/en-fr.csv', nrows=1050)
from transformers import MarianMTModel, MarianTokenizer
import torch
import time

# Load the English-French translation model
model_fr = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
tokenizer_fr = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

# Move the model to GPU if available
if torch.cuda.is_available():
model_fr = model_fr.to('cuda')

# Assuming df is your English-French DataFrame
# Add a new column for translations
df['French_Translation'] = ''

# Function to batch translate English sentences to French
def batch_translate_to_french(sentences):
tokens_fr = tokenizer_fr(sentences.tolist(), return_tensors="pt", padding=True, truncation=True)

# Move tokens to GPU if available
if torch.cuda.is_available():
tokens_fr = {k: v.to('cuda') for k, v in tokens_fr.items()}

translated_fr = model_fr.generate(**tokens_fr)
return tokenizer_fr.batch_decode(translated_fr, skip_special_tokens=True)

# Measure the execution time
start_time = time.time()

# Batch size for translation
batch_size = 32

# Iterate through each batch in the DataFrame and translate the 'English' column
for i in range(0, len(df), batch_size):
batch_df = df.loc[i:i+batch_size-1]
english_sentences = batch_df['en']

# Translate the batch and update the DataFrame
translated_fr_batch = batch_translate_to_french(english_sentences)
df.loc[i:i+batch_size-1, 'French_Translation'] = translated_fr_batch

# Calculate and print the execution time
end_time = time.time()
execution_time = end_time - start_time
print(f"Batch Size: {batch_size}, Execution Time: {execution_time} seconds")

# Display the updated DataFrame
df

The result of the provided code is the added column named “French_Translation” which contains the translation provided by the model “Helsinki-NLP/opus-mt-en-fr”.

Accuracy

On the modified data frame, we want to test the accuracy of how well the model translated the provided sentences. To do that we used bertscore and SacreBLEU.

from sacrebleu import corpus_bleu

# Reference translations (ground truth)
reference_translations = df['fr'].tolist() # column 'fr' with reference of translations

# Candidate translations (model output)
candidate_translations = df['French_Translation'].tolist() # French_Translation' is the column with model translations

# Calculate BLEU score
bleu_score = corpus_bleu(candidate_translations, [reference_translations])
print(f"SacreBLEU Score: {bleu_score.score}")

By doing this we got the result SacreBLEU Score: 40.985200167840425.

from bert_score import score

# Reference translations (ground truth)
references = df['fr'].tolist() # column fr with reference translations

# Candidate translations (model output)
hypotheses = df['French_Translation'].tolist() # 'French_Translation' : column with model translations

# Calculate BERTScore
P, R, F1 = score(hypotheses, references, lang='fr', verbose=True)

print(f"BERTScore Precision: {P.mean().item():.4f}")
print(f"BERTScore Recall: {R.mean().item():.4f}")
print(f"BERTScore F1: {F1.mean().item():.4f}")

Using BERTScore, we got the following results:

done in 6.17 seconds, 162.02 sentences/sec
BERTScore Precision: 0.8966
BERTScore Recall: 0.8943
BERTScore F1: 0.8952

import numpy as np

from Levenshtein import distance as lev

levenshtein_scores = [lev(ref, hyp) for ref, hyp in zip(references, hypotheses)]
avg_levenshtein_score = np.mean(levenshtein_scores)
print(f"Average Levenshtein Score: {avg_levenshtein_score:.4f}")

The average Levenshtein Score is 48.1990.

Upcoming Upgrades

Based on the findings, the score is better than average. Using more data would be one way to improve this, but it would take a lot of time, thus additional GPUs and improved code optimization would also be required. Or, we may attempt a different data model.

--

--