Decoding the Impact of Weight Decay on MBart-large-50 for English-Spanish Translation

Drishti Sushma
2 min readSep 11, 2023

--

Introduction

The MBart-large-50 model has astonishing capabilities to cater to multiple languages. However, like any other neural network-based model, it also offers parameters that can be tuned to optimize performance. One such parameter is weight decay, which plays a pivotal role in controlling overfitting and generalization. In this study, we aimed to understand the influence of varying weight decay values on the MBART-50’s performance for English to Spanish translation.

Methodology

The MBART-50 model was trained for 4 epochs on “paralel-translation-corpus-in-22-languages” from Kaggle, with weight decay values varied in the set [0, 1e-1, 1e-2, 1e-3, 1e-4]. The main metrics of focus during this experiment were the training loss, validation loss, Bleu score, and Rouge score. Additionally, training time and evaluation time were also recorded to understand any computational differences.

Experimental Setup

  1. All experiments were conducted using A100 GPU.
  2. Dataset: “paralel-translation-corpus-in-22-languages
  3. Model: mbart-large-50
  4. BATCH_SIZE = 8
  5. LEARNING_RATE = 1e-5

Observations and Insights

Results obtained after varying the value of weight decay from 0 to 1e-4
  1. The variation in weight decay doesn’t seem to significantly affect the metrics (BLEU, ROUGE) across the epochs. The results are relatively close to each other.
  2. Training and evaluation times vary somewhat across different weight decay values. However, the difference is within a few minutes, so it might not be a significant factor for choosing one weight decay over another.
  3. All configurations appear to have a slight increase in the metrics from Epoch 1 to Epoch 4.
  4. The BLEU score and the Rouge metrics seem to plateau or slightly increase by the 4th epoch in all configurations.

Future Work

  1. Exploration of Weight Decay Interplay: Delving into how weight decay interacts with other hyperparameters in the model to understand synergistic effects.
  2. Different Language Pairs: Examining the impact of weight decay on translation tasks across various language pairs. It would be interesting to see if its effects remain consistent or vary across languages.
  3. Ensemble Approaches: Exploring if combining models trained with different weight decay values leads to any improvements in translation quality.

Conclusion

Hyperparameter tuning remains pivotal in optimizing neural models. Weight decay, a mechanism to counteract overfitting by penalizing large weights, shows only a muted influence on the MBart-50 model’s English-Spanish translation performance.

--

--