MTet: Multi-domain Translation for English and Vietnamese

Collect high quality data and train a state-of-the-art Neural Machine Translation model for Vietnamese.

Published in

Google Developer Experts

4 min readMar 17, 2022

Introduction

Machine translation (MT), the task of mapping content from one language to another automatically by computers, is unarguably one of the most important applications of natural language processing (NLP). Despite great progress in MT using AI, specifically deep learning and neural networks, English-Vietnamese translation quality still lags behind, mostly due to the lack of high-quality datasets at scale. Recently, VietAI has released a high-quality English-Vietnamese translation corpus of 3.3M examples to the research community to spur further progress in research.

Continuing this effort, we are excited to introduce our second release of VietAI’s MTet project, which stands for Multi-domain Translation for English and VieTnamese. With this release, we further improved on the first-ever multi-domain English-Vietnamese translation dataset at scale to release up to 4.2M examples across 11 domains. In addition, we demonstrated state-of-the-art results on IWSLT’15 (+3.5 BLEU for English-Vietnamese). We hope that our efforts will inspire further contributions to an ever-growing repository of high-quality datasets for the Vietnamese NLP community.

VietAI is a non-profit organization with the mission of building a community of world-class AI experts in Vietnam. VietAI has nurtured and trained thousands of students and experts in AI, 3 of whom are the first Google Developer Experts in Machine Learning in Vietnam.

Dataset

In this release, we cleaned and deduplicated our first version of the dataset (datav1), while adding in 1.2M training text pairs from various text sources. This grows our dataset from 3.0M to near 4.2M training text pairs in datav2.

The additional data comes from two sources. First, we use modelv1 to score, filter, and pair high-quality data from existing large and noisy sources (OpenSubtitles, MultiCCAligned, and Wikilingua) that has yet to be incorporated in datav1. Second, we perform a mix of automated and manual scraping from a list of 30 public websites, ranging across multiple different domains such as medical publications, religious texts, engineering articles, literature, news, and poems. We also calibrated our previous test set to balance between different translation domains.

Utilizing Google Cloud Platform, TPUs, and Tensorflow

By using TPU v3–8 and TPU v3–32, we were able to train larger Transformer models and achieved state-of-the-art results on multiple test sets. Utilizing Google Cloud Storage with flexibility in region, we are also able to distribute our data pipelines for training on each specific Transformer model and TPUs (v2–8, v3–8, and v3–32). The improvement in the training speed of TPU v3 (compared to TPU v2–8) also allows us to train large models for longer and faster.

Thanks to GCP credit generously provided by Google Cloud, we are able to run 10 TPUv2 in parallel to score and filter high-quality training data from very large but noisy translation datasets such as MultiCC (20M sentences) and OpenSubtitles corpus (3.5M sentences), in the end, contributed close to half of the new data.

Model and Results

In addition to improving dataset size and diversity, we employed vanilla Transformer architecture with Transformer-tall18 setting, doubled in size when compared to Transformer-tall9 and better in translation quality in IWSLT2015 benchmark.

Some good results:

EnglishWithout arguments, 'print' displays the entire partition table. However with the following arguments it performs various other actions.VietnameseKhi không có đối số, " print " hiển thị toàn bộ bảng phân vùng. Nếu đưa ra các đối số theo sau, thì nó làm một số hành vi khác.EnglishWe report a seven-year-old female presenting with fever, dry cough, and abdominal pain after that.VietnameseChúng tôi báo cáo một trường hợp bệnh nhi nữ, 7 tuổi vào viện với triệu chứng ho khan, sốt và đau bụng dữ dội sau đó.

With the help of Google’s Cloud TPUv3 and Google Cloud Platform infrastructure, we are able to train large models for faster and longer, in the end achieving state-of-the-art translation quality on our Transformer models v2.

In Figure 1, we report the results of MTet models on the IWSLT2015 Translation corpus. MTet Transformer-tall18 models achieve state-of-the-art translation for both Vietnamese and English. There is also a significant improvement from training on a larger MTet dataset, contributing to 0.7% and 1.6% for En-Vi and Vi-En respectively.

Our Transformer-tall18 model, training on our new release high-quality VietAI Translation dataset, outperforms the existing M2M100, a larger model trained on a much larger, but noiser dataset (Figure 2). We achieve state-of-the-art results on both En-Vi and Vi-En tasks (5.8% and 12.4% higher than M2M100 respectively) while being much smaller.

This work was conducted by VietAI research team (Chinh Ngo, Hieu Tran, Long Phan, Trieu H. Trinh, Hieu Nguyen, Minh Nguyen, Minh-Thang Luong).

To see more details, please check out the VietAI Translation Project.