Oliver Atanaszov
Jul 29 · 5 min read

This post shows how to fine-tune a powerful Transformer for Sentiment Analysis. Also, it serves as a preliminary to my follow-up post on how to productionalize research code using Docker.

Update: published 🐳 From Research to Production: Containerized Training Jobs

Figure 1: Pretrained language models. Source: http://jalammar.github.io/illustrated-bert/

What are Transformers and why are they useful?

The advent of pretrained language models in 2018 has been a major game changer for the Natural Language Processing (NLP) community, often dubbed as NLP’s ImageNet moment. Since Google Brain’s incredible success applying Transformers to Neural Machine Translation, these simple yet immensely powerful neural network architectures have been dominating the field. Papers are reporting state-of-the-art results, with bigger and better models are being published continuously (GPT-2, BERT and XLNet, just to name a few), clever applications (like Write With Transformer) and useful tools (pytorch-transformers) are being released every week. Despite the hype, Transformers prove to be useful for practical applications as well.

Figure 2: Transformer Fine-Tuning on a downstream task. Here, the fine-tuning task is sentiment analysis of movie reviews. Only the yellow part’s parameters are trained from scratch (0.001 % of total).

Without going into details, the key to their success is scale. Practically, unlimited amount of unlabelled text data can be scraped from the web with very little effort. Moreover, the Transformer’s purely feed-forward architecture allows highly parallelized, efficient training on huge datasets, with the objective of simply predicting words based on their context. This unsupervised — or as often called, self-supervised — fitting massive amounts of data lead to huge models, having previously unseen number of parameters (often billions!). Although pretraining in this massive data regime is very expensive and is obviously not feasible on consumer hardware (nor on typical servers), good news is that most of these models are published by the authors. Even better, it has been shown empirically, that practical downstream tasks (such as text classification, question answering or recognising textual entailment) can be improved significantly by fine-tuning these models on some smaller, but supervised datasets. This technique is called transfer learning and is exactly what we are going to do: rather than training some model from scratch, we will simply take some parameters (i.e. millions of floats structured in some way) and use them to initialise our classifier.

Figure 3: Transfer learning. Source: http://ruder.io/transfer-learning/

Usually, transfer learning helps to fit our target task dataset in fewer epochs and achieve competitive results with smaller number of training samples. We are going to exploit this idea and train our sentiment analyser by fine-tuning a pretrained language model on only 5000 training samples. For details on transfer learning, see this great tutorial from NAACL 2019.


Typically, a Data Science project starts with exploratory data analysis and model prototyping, in other words, cleaning and understanding our dataset, experimenting with different models, architectures and parameters, etc. This highly iterative, tedious work is out of the scope of this post. Rather, we’ll focus on assemling an end-to-end pipeline that accomplishes the task at hand. We start with pip installing some libraries:

pip install pandas tqdm
pip install torch==1.1.0 pytorch-transformers pytorch-ignite


We will use a subset of the IMDB Movie Reviews dataset, containing reviews of movies and their sentiments, i.e. whether the review is positive or negative. Let’s download imdb5k.tar.gzand unpack it to DATA_DIR:

Simply read the unzipped .csv files using Pandas and remove html tags and whitespaces:

Now, dataset is a dict of Pandas DataFrames, containing {label, text} pairs:

Figure 4: Training data.

Text processing

Next, we have to encode the texts to sequences of integers so they can be processed by our neural network model. To do this, we have to use the same tokenizer that was used for processing the data at the pretraining stage. We’ll use the amazing pytorch-transformers’sBertTokenize.

The class TextProcessor wraps this tokenizer and implements the method process_example. It is responsible to encode text as integers, followed by padding or truncating so they are all the same(NUM_MAX_POSITIONS) length. This number is determined by the Transformer we are going to fine-tune and is a hard limit in this case. A special classification token[CLS] is appended to the end of each sequence. This token’s hidden state will be fed to our classification head at fine-tuning. Labels are converted to integers using thelabel2int mapping.

Fine-tuning config

Let’s define our fine-tuning config: the number of classes, the dropout probability, the relative size of the validation set, etc.

Data iterators

In the below snippet, create_dataloaders simply converts all rows in the input DataFrame using our TextProcessor and finally, data iterators are created using PyTorch’s DataLoader.


We adopt the architecture and pretrained weights from NAACL 2019’s tutorial Transfer Learning in NLP. The base model will be an OpenAI GPT-style Transformer, which is extended with a single linear layer with num_classes (here 2) output neurons, forming TransformerWithClfHead:

This snippet downloads the pretrained weights state_dict — obtained by fitting a Transformer with a language modelling head on wikitext-103and the pretraining config, instantiates our TransformerWithClfHead model and loads the weights into it:

The resulting model has approx. 50 million parameters, from which only embed_dim * num_classes (820) will be trained from scratch (in the Transformer-era this is considered a rather moderate model size).

Prepare fine-tuning and evaluation loops

For compact yet full-featured training and evaluating pytorch-ignite is extremely convenient. Simply, our trainer and evaluator Engines let us to call our update functions on each batch of an iterator. We attach accuracy as our main metric and evaluate our validation set after each epoch. A custom learning rate scheduler, a progress bar and model checkpointing is also added to the trainer:

Finally, let’s fine-tune and evaluate our model:

Figure 5: Fine-tuning on 5000 training samples.

Our model reaches ~90% test accuracy after only three epochs in a few minutes on a single GPU, which is not bad given that we only used 5000 training examples.

Let’s suppose that this performance level meets our expectations and we are happy with this model. Check out how to encapsulate this procedure in a self-contained Docker container in my follow-up post From Research to Production: Containerized Training Jobs!

🐳 Follow-up post: From Research to Production: Containerized Training Jobs
📓 End-to-end fine-tuning code can be found in this Jupyter Notebook

The Startup

Medium's largest active publication, followed by +489K people. Follow to join our community.

Oliver Atanaszov

Written by

Studying Computational Neuroscience & Data Science at SAP https://ben0it8.github.io/

The Startup

Medium's largest active publication, followed by +489K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade