NLP Deep Learning Training on Downstream tasks using Pytorch Lightning — Intro — Part 1 of 7

Narayana Swamy
CodeX
Published in
4 min readJul 23, 2021

--

Large Transformer based Language models like Bert, GPT, Marian, T5 etc. are developed and trained to have a statistical understanding of the language/text corpus they has been trained on. They are trained in a self-supervised fashion (without human labeling of data) using techniques like masked token prediction, next sentence prediction etc. These models are not very useful for specific practical NLP tasks until they go thru a process called transfer learning. During transfer learning, these models are fine tuned in a supervised way on a given task by adding a Head (that consists of a few neural layers like linear, dropout, Relu etc.) to the particular pre-trained Language Model (not sure why they call it the Head instead of the Tail as the supervised Layers are added to the bottom of the pre-trained Model). This series is about using Pytorch Lightning framework to fine tune the Language models for the different NLP specific tasks.

All the major maintainers of pre-trained NLP models like HuggingFace, FastAI, SparkNLP etc all have Trainer APIs to fine tune the Language models using published datasets or your own labeled dataset. An example Trainer API from HuggingFace for fine-tuning Bert on the IMDB dataset would look like this:

python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--do_predict \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/imdb/

Though the API may seem to make it easy to fine tune the Language models for specific tasks by providing a one-liner command with config details, you give up some of the control of fine tuning like changing the architecture of the Head of the Language model (like adding another Dropout layer or another Dense Layer), using a different Learning rate scheduler, using a different loss function (maybe a weighted loss function on unbalanced data), using a different metric to measure model performance etc. You could in theory make a change to the python API script file to accommodate your needs but that is not easy. Most of these python script files are poorly commented. The motivation for writing this series is to showcase a better way to fine tune the Language models using the Pytorch Lightning framework in an organized fashion. This article series is obviously geared towards more advanced practitioners, researchers and students who want better control of the NLP Training process.

The series will showcase the fine tuning of the following downstream tasks:

Downstream Tasks and the pre-trained Models used for it

All the Pytorch Lightning Colab Notebooks are organized into Sections in the fashion shown below. The post in each of the Series will make appropriate comments around each Section that are relevant to the particular Task being Trained on. There will be repeat of certain comments in each part of the Series to make each Part stand on its own in terms of completeness.

  1. Download and Import the Libraries
  2. Download the Data
  3. Define the Pre-Trained Model
  4. Define the Pre-Process function or Dataset Class
  5. Define the DataModule Class
  6. Define the Model Class
  7. Define the Pytorch Lightning Module Class
  8. Define the Trainer Parameters
  9. Train the Model
  10. Evaluate Model Performance
  11. Run Inference on the Trained Model
  12. Open TensorBoard Logs

Pytorch Lightning framework provides a more manageable method to organize the code around Training or Fine Tuning pre-trained Models. It has obfuscated a lot of the grunge work around setting up Training, Validation loops, moving model/data to Cuda etc and allows the ML data scientist to focus on the important aspects of Training or Fine Tuning a model. The organization of the code into separate classes/sections makes it more readable and understandable. Pytorch Lightning has also made a lot of updates in the past 12 months that makes it more flexible — for example, to use a scheduler that changed learning rate at every training step, you had to do place scheduler specific code in the training_step and training_epoch_end functions before but not anymore.

Go to the 2nd Part of this Series where we will fine tune a DistilBert model for binary classification of IMDB movies review data. The different parts can be accessed here:

  1. Part 2 — IMDB movies review binary classification
  2. Part 3 — Named Entity or Token Recognition on CoNLL 2003 data
  3. Part 4 — Multiple Choice Answering on Swag data
  4. Part 5 — Question Answering on SqUAD 1.1 data
  5. Part 6 — Summarization on XSum data
  6. Part 7 — Translation on WMT16 English to Romanian data

One thing to note is that all the Code Examples shown in this Series is on Colab Notebooks as the GPU is free there. It is ok for demonstration purposes. It is not practical to run Research experiments or Production training on Colab notebooks as the NLP tasks will certainly require multiple GPUs to use the larger Transformer models. I will be converting the code soon to Python scripts that can be run on AWS Sagemaker or Azure ML and will update this Intro once that is done.

--

--

Narayana Swamy
CodeX
Writer for

Over 17 years of diverse global experience in Data Science, Finance and Operations. Passionate about using data to unlock Business value.