Breaking Language Barriers with M2M100: A New Benchmark in Multilingual Translation

Abdullah Mubeen
spark-nlp
Published in
5 min readMay 29, 2024

The M2M100 model sets a new benchmark for multilingual translation, supporting direct translation across 9,900 language pairs from 100 languages. This feature represents a significant leap in breaking down language barriers in global communication. Unlike traditional English-centric models, M2M100 facilitates direct translations between any pair of languages, making it a revolutionary tool in the field of machine translation.

Summary of our Many-to-Many dataset and multilingual model

Understanding Multilingual Translation Models

English-Centric Multilingual Models

Traditional multilingual translation models are often English-centric, as depicted in Figure (a). These models translate from various languages to English and vice versa. While effective, this approach does not cater to the global need for direct translations between non-English languages.

Many-to-Many Multilingual Models

The M2M100 model, shown in Figure (b), is designed to translate directly between any pair of 100 languages, thus eliminating the dependency on English as an intermediate language. This many-to-many approach allows for more natural and accurate translations, reflecting real-world translation needs.

Building the M2M100 Model

The M2M100 model was developed and open-sourced by a team of researchers from Facebook AI, as detailed in their paper “Beyond English-Centric Multilingual Machine Translation.” The model is built upon a large-scale dataset covering thousands of language directions with supervised data.

Introduction to Multilingual Machine Translation

Multilingual Machine Translation (MMT) aims to build a single model to translate between any pair of languages. Neural network models have been successful for bilingual machine translation, and neural MMT models have recently shown promising results. MMT models share information between similar languages, benefiting low-resource directions and enabling zero-shot translation. However, they have historically underperformed compared to bilingual models when trained on the same language pairs due to the need to split model capacity among many languages. This has been alleviated by increasing model capacity, which requires larger, more laborious multilingual training datasets. Most prior work has focused on English-Centric datasets, translating to and from English but not between non-English languages.

Large-Scale Many-to-Many Dataset

Dictionary Coverage per Language

To address the limitations of English-centric models, researchers created a Many-to-Many dataset for 100 languages using a novel data mining strategy that exploits language similarity to reduce complexity. They also used backtranslation to improve the model’s quality on zero-shot and low-resource pairs. The result is a dataset with 7.5 billion training sentences for 100 languages, providing direct training data for thousands of translation directions.

Performance of many-to-English multilingual translation compared to bilingual baselines trained on mined data and bilingual + back translation.

Model Architecture and Training

The data in a Many-to-Many dataset increases quadratically with the number of languages, making standard-capacity neural networks underfit. Researchers trained models over 50 times larger than current bilingual models using model parallelism. They implemented scaling strategies, including a deterministic mixture-of-experts strategy to split model parameters into non-overlapping groups of languages, trained with a novel re-routing strategy. This approach reduced the need to densely update parameters and improved parallelization in a multi-machine setting. They scaled the model to 15.4 billion parameters, efficiently training it on hundreds of GPUs. The resulting model could directly translate between 100 languages without pivoting through English, achieving performance competitive with bilingual models on benchmarks like WMT.

Segmentation and Multilingual Dictionary

The model uses SentencePiece for tokenization, producing subword units based on their frequency in the training dataset. To ensure sufficient representation of low-resource languages and less frequent scripts, researchers used temperature sampling to balance the distribution of subword units across languages.

Transformer Architecture

The M2M100 model is based on the Transformer sequence-to-sequence architecture, which includes an encoder and a decoder. The encoder transforms the source token sequence into embeddings, and the decoder sequentially produces the target sentence. Both the encoder and decoder consist of multiple Transformer layers, each comprising a self-attention layer and a feed-forward layer.

Training and Generation

The training data was split into shards to manage memory consumption, with higher-resource languages having more shards and lower-resource languages fewer. The model was trained using the Adam optimizer with regularization techniques to stabilize training. For generation, the model used beam search with a beam size of 5.

For a more in-depth understanding of the technical implementation and architecture of the M2M100 model, refer to the original paper;

M2M100: Beyond English-Centric Multilingual Machine Translation.

Practical Application:

Here’s a practical example of using the M2M100 model to translate from Chinese to English using Spark NLP:

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import M2M100Transformer
from pyspark.ml import Pipeline

# Create a DataFrame with the text to be translated
data = spark.createDataFrame([["每周六,在家附近的一个咖啡馆里,我坐在阳光最好的位置上"]]).toDF("text")

# Set up the Document Assembler
doc_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

# Set up the M2M100 Transformer
m2m100 = M2M100Transformer.pretrained() \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation") \
.setSrcLang("zh") \
.setTgtLang("en")

# Build the pipeline
pipeline = Pipeline(stages=[
doc_assembler,
m2m100])

# Run the pipeline
model = pipeline.fit(data)
result = model.transform(data)
result.select("generation.result").show(truncate=False)

Performance and Evaluation

The M2M100 model outperforms traditional models, achieving gains of more than 10 BLEU points when translating directly between non-English languages. It performs competitively with the best single systems of WMT, making it a highly effective tool for multilingual translation.

Comparison of various evaluation settings from previous work.

Conclusion

The M2M100 model represents a significant advancement in the field of multilingual translation. Supporting direct translation across 9,900 language pairs from 100 languages breaks down traditional language barriers and facilitates more natural and accurate communication worldwide.

For further reading and resources, consider exploring the following:

--

--