š¤ Transformer Fine-Tuning for Sentiment Analysis
This post shows how to fine-tune a powerful Transformer for Sentiment Analysis. Also, it serves as a preliminary to my follow-up post on how to productionalize research code using Docker.
Update: published š³ From Research to Production: Containerized Training Jobs
What are Transformers and why are they useful?
The advent of pretrained language models in 2018 has been a major game changer for the Natural Language Processing (NLP) community, often dubbed as NLPās ImageNet moment. Since Google Brainās incredible success applying Transformers to Neural Machine Translation, these simple yet immensely powerful neural network architectures have been dominating the field. Papers are reporting state-of-the-art results, with bigger and better models are being published continuously (GPT-2, BERT and XLNet, just to name a few), clever applications (like Write With Transformer) and useful tools (pytorch-transformers) are being released every week. Despite the hype, Transformers prove to be useful for practical applications as well.
Without going into details, the key to their success is scale. Practically, unlimited amount of unlabelled text data can be scraped from the web with very little effort. Moreover, the Transformerās purely feed-forward architecture allows highly parallelized, efficient training on huge datasets, with the objective of simply predicting words based on their context. This unsupervised ā or as often called, self-supervised ā fitting massive amounts of data lead to huge models, having previously unseen number of parameters (often billions!). Although pretraining in this massive data regime is very expensive and is obviously not feasible on consumer hardware (nor on typical servers), good news is that most of these models are published by the authors. Even better, it has been shown empirically, that practical downstream tasks (such as text classification, question answering or recognising textual entailment) can be improved significantly by fine-tuning these models on some smaller, but supervised datasets. This technique is called transfer learning and is exactly what we are going to do: rather than training some model from scratch, we will simply take some parameters (i.e. millions of floats structured in some way) and use them to initialise our classifier.
Usually, transfer learning helps to fit our target task dataset in fewer epochs and achieve competitive results with smaller number of training samples. We are going to exploit this idea and train our sentiment analyser by fine-tuning a pretrained language model on only 5000 training samples. For details on transfer learning, see this great tutorial from NAACL 2019.
Modeling
Typically, a Data Science project starts with exploratory data analysis and model prototyping, in other words, cleaning and understanding our dataset, experimenting with different models, architectures and parameters, etc. This highly iterative, tedious work is out of the scope of this post. Rather, weāll focus on assemling an end-to-end pipeline that accomplishes the task at hand. We start with pip installing some libraries:
pip install pandas tqdm
pip install torch==1.1.0 pytorch-transformers pytorch-ignite
Data
We will use a subset of the IMDB Movie Reviews dataset, containing reviews of movies and their sentiments, i.e. whether the review is positive or negative. Letās download imdb5k.tar.gz
and unpack it to DATA_DIR:
Simply read the unzipped .csv files using Pandas and remove html tags and whitespaces:
Now, dataset
is a dict of Pandas DataFrames, containing {label, text} pairs:
Text processing
Next, we have to encode the texts to sequences of integers so they can be processed by our neural network model. To do this, we have to use the same tokenizer that was used for processing the data at the pretraining stage. Weāll use the amazing pytorch-transformersāsBertTokenize
.
The class TextProcessor
wraps this tokenizer and implements the method process_example
. It is responsible to encode text as integers, followed by padding or truncating so they are all the same(NUM_MAX_POSITIONS)
length. This number is determined by the Transformer we are going to fine-tune and is a hard limit in this case. A special classification token[CLS]
is appended to the end of each sequence. This tokenās hidden state will be fed to our classification head at fine-tuning. Labels are converted to integers using thelabel2int
mapping.
Fine-tuning config
Letās define our fine-tuning config: the number of classes, the dropout probability, the relative size of the validation set, etc.
Data iterators
In the below snippet, create_dataloaders
simply converts all rows in the input DataFrame using our TextProcessor
and finally, data iterators are created using PyTorchās DataLoader.
Model
We adopt the architecture and pretrained weights from NAACL 2019ās tutorial Transfer Learning in NLP. The base model will be an OpenAI GPT-style Transformer, which is extended with a single linear layer with num_classes
(here 2) output neurons, forming TransformerWithClfHead
:
This snippet downloads the pretrained weights state_dict
ā obtained by fitting a Transformer
with a language modelling head on wikitext-103 ā and the pretraining config, instantiates our TransformerWithClfHead
model and loads the weights into it:
The resulting model has approx. 50 million parameters, from which only embed_dim
* num_classes
(820) will be trained from scratch (in the Transformer-era this is considered a rather moderate model size).
Prepare fine-tuning and evaluation loops
For compact yet full-featured training and evaluating pytorch-ignite is extremely convenient. Simply, our trainer
and evaluator
Engines let us to call our update functions on each batch of an iterator. We attach accuracy as our main metric and evaluate our validation set after each epoch. A custom learning rate scheduler, a progress bar and model checkpointing is also added to the trainer:
Finally, letās fine-tune and evaluate our model:
Our model reaches ~90% test accuracy after only three epochs in a few minutes on a single GPU, which is not bad given that we only used 5000 training examples.
Letās suppose that this performance level meets our expectations and we are happy with this model. Check out how to encapsulate this procedure in a self-contained Docker container in my follow-up post From Research to Production: Containerized Training Jobs!
š³ Follow-up post: From Research to Production: Containerized Training Jobs
š End-to-end fine-tuning code can be found in this Jupyter Notebook