Encoder Decoder models in HuggingFace from (almost) scratch

4 min readOct 2, 2020

Transformers have completely changed the way we approach sequence modeling problems in many domains. The number of variants and advancements over the original transformer model [1] have also been published at a steady rate, among the most famous being BERT [2]. Several Transformer and BERT variants are published every year and it becomes very tedious for researchers to try out new variants quickly on their problem. Even if someone is not interested in trying out the latest variant or a new dataset, they still need to train an existing model on common datasets to be able to use them in their research which can be computationally expensive.

Thankfully, the team and community at Huggingface have developed a transformers library that provides implementation of several transformer models with simple to use and extensible APIs. They also provide popular models pretrained on commonly used datasets, eliminating a major part of the startup time involved in using transformers. Huggingface also released a Trainer API to make it easier to train and use their models if any of the pretrained models dont work for you.

The Huggingface documentation does provide some examples of how to use any of their pretrained models in an Encoder-Decoder architecture. However, if you are looking at a different dataset or application, you may have to write more than just a few lines. This post tries to walk through the process of training an Encoder-Decoder translation model using Huggingface from scratch, primarily using just the model APIs. The intention of this post is to reduce the reader’s time in browsing the documentation and provide a template for training customized Encoder-Decoder models. Familiarity with Encoder-Decoder models and PyTorch is assumed. Details of dataloaders, loss criterion, optimizers, etc. are not discussed in detail.

Configuration

To ease experimentation and reproducibility, it is recommended to separate out hyperparameters and other constants from the actual code. Everyone has their own way to do this and I recently adopted this style of storing all configuration as a json and loading it during the actual training. Different experiments can have different json files thus separating the actual code from the hyperparameter configurations. This also avoids lengthy command line arguments to the training job. I use this method through this post. The default config json file is here. The config can then be loaded easily using:

Dataset

We will be using the Stanford NLP Group’s English-German translation dataset available here [3]. This is processed dataset and large enough to allow transformers to learn something useful, but not too large to require excessive compute. The dataset consists of over 15M English texts and corresponding translated German text. Of course, you can use your own datasets instead. We will be writing our own data loaders using Pytorch, but first we need to train the tokenizers.

Tokenizers

Once we have the config sorted out, we begin by training tokenizers to process English and German texts using Huggingface. The details are well explained in the official documentation and I recommended browsing all available options. For this task, we will train a BertWordPieceTokenizer. The Encoder has its own tokenizer associated with English texts and the Decoder has a separate tokenizer associated with German texts. All the parameters can be read from the config file.

DataLoaders

We can now define the dataloaders for the training and validation sets extending the PyTorch Dataset class. Note: We do not have to do this for all Huggingface models if we use the Trainer API. The custom dataloader class takes as input the respective EN and DE files, the corresponding tokenizers and max lengths. Once we have tokenized the texts, we first define a __getitem__() to return the corresponding pair of EN and DE texts.

We then define the collate function to perform padding and generate masks for each batch.

Training

We can now write the training loop. As usual, we create the dataloaders and initialize the BertModel configs. Then, we create an instance of the EncoderDecoderModel. Setting is_decoder=True is important in the decoder config and is what makes everything work. Note that the decoder is a BertForMaskedLM model since we need the LM head to generate the output sequence. The encoder can be a simple BertModel. You can pretrain the encoder as a BertForMaskedLM separately and then load it as BertModel which would simply remove the final layer.

The rest of the training process is the standard PyTorch code. You iterate over batches of data to obtain the model predictions, compute the loss and update model parameters.

The entire working example is available at the Github repo.

PS: Very recently, the Trainer API added support for translation tasks. This might make some of the contents of this post obsolete, but if you want finer-grained control over the training or cant upgrade to the latest version of Huggingface, this post might still be helpful.

References:

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.

[2] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[3] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.