Optimizing Adapters for Neural Machine Translation

Vasudev Gupta
OffNote Labs
Published in
10 min readApr 13, 2021

--

Neural machine translation (NMT) is an active field of research. For NMT, we use a seq2seq model, which consists of an encoder and decoder — encoder transforms source language tokens into hidden representations and decoder transforms these representations back to the target language. Transformer-based architectures are now most popular for NMT — both encoder and decoder consist of stacked Transformer blocks.

One of the major limitations when working with Transformers is that we need a very large sized model to be able to achieve good results. Training such large models from scratch for every language pair requires a lot of compute and memory. Further, we need to store those large amount of weights for every language pair, which can make deployment of these models very hard for solving real world problems.

Possible solutions to above problems are following: (a) Train a single model for multiple pairs of language (b) reuse a pre-trained encoder-decoder model to avoid training from scratch. While it is possible to train a single model on the joint vocabulary of all languages, low-resource languages suffer due to sampling biases and fixed capacity. Also, we need to retrain the model to add every new language, i.e., learning isn’t incremental.

If low-resource languages are sampled too often, the model may overfit; if high-resource languages are not trained on enough, the model will underfit. — mT5

One way to enable both (a) and (b) effectively is to use adapters. Adapters introduce few task (language) specific, learnable weights to the original model. This enables us to keep pre-trained weights untouched and hence to share pre-trained weights across several tasks.

Adapters

Figure from the AdapterHub paper

Adapters are small feed-forward like networks which aim to adapt any pre-trained model to required task without changing weights of the original pre-trained layers. Therefore, adapters should give the same output which we would get by fine-tuning that pre-trained layer on a given task.

Benefits: Adding adapters will limit the language-pair specific weights to the number of parameters introduced by small feed-forward network at many locations.

For more details about adapters, please refer to the AdapterHub paper.

In this article, we will discuss our effort to train Indic-English translation models (e.g. Hindi -> English) using adapters, efficiently. Most existing research works add adapters to all the layers of the seq2seq model. In this article, we investigate

  • Are all the adapters equally important, i.e., contribute towards performance equally?
  • Can we get an equivalent performance using only a subset of adapters?

Need for Pre-trained Models

Training from scratch is not a good strategy because Transformer architecture based models are generally very large, and we need a very large amount of data to get a model which can generalize well. But we don’t have large amount of parallel data for many Indic language pairs.

This raises the need of using pre-trained model which already understand some representation of each language during pre-training stage and can focus on just translation during fine-tuning. Indian languages are low resource so we need a model which can understand language structure very well (before training for translation) so that during translation task, it can just focus on how to translate and not on understanding each language structure. This way we can work with small data and benefit from relations among our target language and other languages (involved during pre-training stage).

Selecting a Pre-trained model

Pre-trained model must support the languages which we want to involve during translation task i.e. it must have seen that language during pre-training. Also the pre-trained model should be seq2seq model (encoder-decoder architecture).

Among so many pre-trained models, it’s generally hard to decide which one to use. Also, many of the pre-trained models are English based, so its hard to find the model which has seen our required language during pre-training stage.

Among pre-trained models for Indian-languages, we had various possible options like IndicBERT (AlBERT like model), Muril, mBART and many more.

IndicBERT: It is a encoder-only model which was pre-trained on multiple Indian languages simultaneously using BERT objective. Pre-trained checkpoint for this model can be found here. Since our task involves text generation, we need encoder-decoder like architecture to be able to generate sequence of different lengths. One of the way to proceed is to simply stack some pretrained-decoder over the IndicBERT model and fine-tune this pair for our task. We didn’t take this approach. You can refer other papers (such as this), if you want to explore that approach.

mBART: It’s an encoder-decoder model which is pre-trained to de-noise multiple languages simultaneously: the model captures plenty of information about multiple languages and is good at tasks involving text generation. We introduced adapters in this model at various locations and decided to fine-tune only adapters to adapt the model (initiated from this checkpoint) for translation of particular language pair; thus preserving the information it has gained during pre-training.

Again, there are several options on how to reuse the pretrained mBART model. We discuss the options below.

Simple fine-tuning

Before going deeper into adding adapters, let’s understand what happens with the simplest approach i.e. to fine-tune it on each language pair separately. This approach will again be inefficient in terms of memory since we will have to save weights of all language-pair models.

For example: let us say, we want translation models for 5 language pairs: hin-eng, guj-eng, tamil-eng, bengali-eng, marathi-eng. mBART fine-tuned on each language pair will have around 2.5 GB of weights / language-pair. So, for these 5 language pairs, we will be having 2.5 x 5 = 12.5 GB of weights. Imagine how much memory our translator will need this way if we want to cover > 100 languages! We need to find way to increase memory efficiency without compromising the performance on each dataset. We will discuss solutions to this in the coming sections.

One more thing to note is that there will be lot of information common among multiple Indian language pairs, e.g., the language structure of hindi and gujarati are more similar as compared to hindi and english. This important property can be considered while planning the strategy to train the model.

We report the results for simple fine-tuning in table presented in later section and made comparison with different ways of adding adapters.

Extending previous example:

Adapter weights per language pair = 100 MB For all 5 language pairs: 5 x 100 = 500 MB Pre-trained weights: 2.5 GBTotal weights: 0.5 + 2.5 = 3GB

Previously, total weights were 12.5 GB and now it is 3 GB only. We have reduced 76% of storage. Now, consider the scenario when we have 100's of such language pairs. Most importantly we are able to do this without compromising task accuracy. We will compare results in next section.

Where to add adapters?

The location of adapters depends on what the pre-training strategy is, and how much our particular task is related to the pre-training strategy. By adding fewer adapters, we further reduce the model size for multiple language pairs.

Another question is ‘where to add adapters’ relative to a particular layer. Most papers on Adapters suggest to add adapters after the feed-forward layer in both encoder and decoder. Some papers include adapters after self-attention layer also.

Since an Adapter’s main role is to adapt each pre-trained layer to what was expected if we would have done complete fine-tuning, we want to figure out which layers are really changing when pre-trained model is fine-tuned on our task.

Our Experiments

We picked up Hindi-English language pair data from the Bhasha dataset for further experimentation. Complete dataset contains around 260 K samples, but we considered 3 subsets: we randomly sampled 20K, 50K, 100K samples from complete dataset and tried to understand the effect of increasing dataset when fine-tuning only the adapters. The goal is to find which adapters are more important and contribute more towards BLEU scores.

To estimate which layers need adapters, we experimented by freezing some of the layers and visualize how much results are varying after/before freezing that particular layer. If we freeze a particular layer and the loss function and final BLEU score do not change much (as compared to complete fine-tuning), then that layer possibly doesn’t need adapter. We restricted our attention to self-attention / cross attention, feed-forward network, embedding layer and conducted these experiments for these layers. We found that all of these layers affect BLEU scores significantly.

Because we are working with a seq2seq model, we have several instances of these layers both on encoder and decoder sides: we have enc-attn, enc-ffn, dec-attn, dec-ffn and the cross-attn layers, and adapters may be attached to each of these layers.

Now, to understand the relative importance of these layers, we incrementally add adapters after these layers. Below tables summarize some of our experiments.

Table-1
Table-2
Table-3 (embed-adapter -> adapter after embedding layer in both encoder & decoder ; enc-attn-adapter -> adapter after self-attention of encoder ; dec-attn-adapter -> adapter after self-attention of decoder ; enc-ffn-adapter -> adapter after feed-forward network of encoder ; dec-ffn-adapter -> adapter after feed-forward network of decoder ;. Note: All the experiments are run for same number of steps on Colab T4.)

From above tables, we make the following observations.

  • When fine-tuning with mBART for the translation task, the best adapter configuration is to add adapters after encoder self-attention, decoder feed-forward layerand embedding layer.
  • Note that other adapter-based NMT approaches add adapters to all layers. In contrast, in our best configuration, we omit the following adapters: encoder feed-forward layer, decoder self-attention layer and cross-attention layer.
  • The best adapter configuration (i.e. enc-self-attn-adapter, dec-ffn-adapter, embed-adapter) performs similar to complete fine-tuning case, while there is a significant difference in model size and training time.

We further observed (see Table-4) that adding the adapter after the embedding layer is crucial and can improve the model’s performance by significant amount. Training only the embedding adapters already fetched us 15 BLEU score; other adapters have milder effect on BLEU. We haven’t seen any papers highlight this issue.

Table-4 (complete tuning -> training all the layers initialized from pre-trained weights (no adapters))

Further, we performed complete fine-tuning & trained adapters (for guj-eng, hin-eng) on complete dataset and compared BLEU scores (see Table-5).

Table-5 (Note: Bhasha guj-eng dataset has only 59K samples; while hin-eng dataset has 260K samples; BLEU score is calculated on around 8000 samples)

We observe that with the smaller set of adapters (best configuration), we obtain BLEU scores similar to those with complete fine-tuning. The models are trained on the mid-size Bhasha dataset only — an interesting question to explore is whether the gap between adapters and complete fine-tuning reduces further if we increase the training data size, by including other datasets.

Our code builds upon the HuggingFace transformers library. All our code changes for adding adapters and performing these experiments, checkpoints and pre-trained models can be found here.

Finally, welook at a few more details about training with adapters. These observations are based on our best adapter configuration.

Deciding hyper-parameters

Unlike transformers which are recommended to fine-tune at low learning rate (orders of 1e-5 generally), we observed that adapters takes lot of time to converge at learning rate of this order. After some experiments, we found that for training adapters, learning rate should be in orders of 1e-3 to be able to get convergence in decent time.

Note that with adapters, we avoid gradient computation for a large part of the model, which saves us a lot of memory during training. Hence, we can train with larger batch sizes and train faster on bigger datasets.

Should we train both the Adapters & Pre-trained model together?

We have observed that model performs very bad when adapters and pre-trained weights are trained simultaneously. Our hypothesis is that since adapters are introduced after every layer of transformer, weights of every successive layer are getting destroyed because adapters are randomly initialized during initial rounds of training.

An important question arises here is how to initialize the adapters. One possible initialization strategy is to make them identity function.

End Notes

  • Thanks to Dr. Nishant Sinha for guiding me throughout the project and helping me to grow in the world of transformers.
  • Thanks to Hugging Face team for building such an awesome library for easy & quick experimentation with transformers.
  • This work was done as part of the OffNote Labs AI Research program. My experience under this program was amazing. I got a chance to read and discuss several papers on Transformers, which helped to dive deeper into the field. This project helped me build a foundation to work on several more complex projects involving Transformers. You can find my other projects here. (twitter, linkedin)

References

  • Simple, Scalable Adaptation for Neural Machine Translation [Paper]
  • Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages [Paper]
  • Parameter-efficient Transfer Learning for NLP [Paper]
  • Domain-Adaptation of Pre-trained Language Models [Paper]
  • MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer [Paper]
  • AdapterHub: A Framework for Adapting Transformers [Paper] [Code]
  • Investigate Multilingual NMT Representations [Paper]
  • Massively Multilingual Neural Machine Translation [Paper]
  • A study of attention-based Neural Machine Translation models on Indian Languages [Paper]
  • Bhasha dataset [Link]
  • IITB hin-eng parallel dataset corpus [Link]

--

--