Embarrassingly small English-to-German Model (258M parameters) with State-of-the-art results

Vinces
5 min readJun 3, 2024

--

Introduction

Over the past few years, scaling has been the primary approach to improving results in Neural Machine Translation. While this strategy has had some success, some researchers started to try the same recipes used in the LLM field (DPO, CPO,…) to improve NMT models quality. Recent research has also shifted focus to neural metrics (such as Comet: https://arxiv.org/pdf/2209.06243) compared to semantic-based metrics like BLEU or ChrF, highlighting that NMT quality measurement remains an open issue.

When reading the Comet paper it is relatively straightforward. They utilize a pretrained model (usually a deep encoder) and fine-tunes an additional layer called the “estimator” based on human feedback from various MT systems (from the WMT conferences).

We will explore how to leverage this concept in our use case.

At the time of writing, a model with 258 million parameters might seem small. However, it was considered the “Large” model in the “Attention is All You Need” paper (2017). Its small size ensures both training and inference are very fast on recent hardware, such as a consumer-grade RTX 4090 GPU. Training takes less than 72 hours on a single card, and inference is 500–1000 times faster than a 13B LLM on the same hardware. Additionally, we can use C++ optimized inference code (e.g., CTranslate2) to achieve sufficient speed on CPUs.

Referenceless Comet:

Comet (https://github.com/Unbabel/COMET) can generate a score for a hypothesis in the target language alongside the source segment, with or without a reference in the target language. When used without a reference, it is called CometKiwi, and it provides a quality estimation of the hypothesis. Unbabel achieved excellent results in the WMT Quality Estimation tasks with CometKiwi.

The strength of CometKiwi lies in two main points:

  1. The pretrained encoder used for Comet is very strong. The smaller model is an XLM-Roberta-Large (560M parameters). More recently, Unbabel released two larger models based on XLM-Roberta-XL (3.5B parameters) and XLM-Roberta-XXL (10B parameters).
  2. The human feedback comes from various WMT conferences (between 2017 and 2020; for some reason, data from 2021–2022 was discarded).

Training CometKiwi involves feeding the pretrained model with the source and hypothesis and training the extra feed-forward layer using scores from the human feedback data. This process is quite simple and requires very few resources.

The main issue with the larger models (XL and XXL) is speed. Our primary idea was to distill these models into a very small one so that we could use it during the training of our NMT model. However, the distillation process did not work well. (For reference, distillation is explained here: https://arxiv.org/abs/1606.07947.)

Quality Aware training:

Our inspiration also comes from a recent paper: Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model. The concept is quite simple: they score the training dataset with a comet-like metric, assign a bucket (based on the score) to each segment, and add this bucket as a tag either in the source segment or the target segment. They demonstrate that the model learns the quality based on the bucket and uses it either in quality prompting or quality scoring.

We extend this concept by integrating distillation, NMT training, and Comet training.

We start the process by scoring our datasets with CometKiwi XL (using XXL was too slow, though it might yield better results). Following the Comet architecture, we add an extra feed-forward layer on top of our NMT model. This extra layer acts as our Estimator. Then, we “co-train” our NMT model and the Estimator in the same training loop, using the source, target, and comet score as inputs.

At the end of the training, our model functions both as an NMT model and a Comet model, enabling us to translate and measure the quality estimation of our output. If we use a beam search of 10 with 10 “nbest” outputs, we can rerank and select the best hypothesis based on the “in-model comet score.”

The main difference between our architecture and Comet is that we do not use layer-wise attention. Instead, we use the output of our decoder (since the NMT model is an encoder-decoder model) and average the last hidden states. Our solution is much faster than using an external Comet model as a reranker because we use a lightweight model, and the source is already encoded.

Results on our En-De model:

Our training procedure is quite standard. We use an encoder-decoder model with 6 layers for each, an embedding size of 1024, and a hidden feed-forward network (FFN) size of 4096. Our estimator has three layers, similar to the CometKiwi one. We used datasets such as CCMatrix, ParaCrawl, News Commentary, and Europarl, along with an equivalent amount of back-translation data.

We trained for 200,000 steps with a batch size of 12,144 tokens, accumulating 6 batches per step. We then fine-tuned for an additional 5,000 steps using only the Comet-related loss.

We compared our results to those from the WMT and some recent papers using large language models (LLMs).

WMT23 Comparison:

From this paper: https://www2.statmt.org/wmt23/pdf/2023.wmt-1.56.pdf, we obtained the following automated scores:

The ranking based on human evaluation yields slightly different results. ONLINE-B (which I suspect to be Google Translate) is ranked number 1 over GPT4 and ONLINE-W (which appears to be DeepL). The score is “WMT22-comet-da,” which is a reference-based score (including the supposed ground truth).

Our system scores at 83.0, which places it in the same cluster as the best online systems.

Comparison with TowerInstruct 7B/13B:

According to the Unbabel TowerInstruct paper, the same score for the 13B model above is 83.98, but the model has 13 billion parameters (50 times more) and covers 10 languages. Additionally, this model is not quite usable on CPU or GPU given the throughput on a desktop. It’s interesting to note that our model outperforms the 7B TowerInstruct and is still 500 times faster.

Further Work:

While our model shows promising results, there is still room for improvement when compared to Google Translate, DeepL, and GPT4. It’s evident that our training dataset is not as extensive as those used by larger companies, and we have yet to scale our model.

This new approach has demonstrated excellent results, and we can extend it to many languages, whether covered by Comet or when we have human feedback on a sizable dataset.

--

--