NLP Deep Learning Training on Downstream tasks using Pytorch Lightning — IMDB Classification — Part 2 of 7

Narayana Swamy
CodeX
Published in
4 min readJul 23, 2021

--

This is part 2 and a continuation of the Series. Please go to the Intro article here that talks about the motivation for this Series. As mentioned in the Intro, we will look at the various sections of the IMDB Classifier Colab Notebook and make appropriate comments for each of the sections.

  1. Download and Import the Libraries — Nothing important here. Just download and import the regular Pytorch and Pytorch Lightning libraries
  2. Download the Data — The IMDB dataset is pre-processed and available for download from the Transformers Datasets Library. But for demonstration purposes, The notebook downloads the data from the public repository and prepares the data in the format that is useful for training purposes. The data has 25,000 training and 25,000 test samples.
  3. Define the Pre-Trained Model — The Pre-Trained Model used here is the DistilBert-Base Model. It is a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance. Once you have trained successfully with this, other pre-Trained models can be tried by changing the model_checkpoint variable.
  4. Define the Pre-Process function or Dataset Class — Here we define the Pytorch Dataset inherited Class IMDBDataset that will create the train, val and test data in the Dataset format that is needed by the DataLoader. Pytorch uses a DataLoader class to build the data into mini-batches. The data is tokenized in this class using the pre-trained tokenizer.
  5. Define the DataModule Class — This is a Pytorch Lightning defined Class that contains all the code necessary to prepare the mini-batches of the data using the DataLoaders. At the start of the training, the Trainer class will call the prepare_data and setup functions first. The prepare_data function is where the target is defined for the data based on the fact that the reviews are arranged in the order that the first 12,500 belong to positive reviews and the rest 12,500 belong to negative reviews. The train data is split 75/25 into Train and Validation data. There is a collate function here that does the padding of the mini-batches. Bert like models will require all the input data of a mini-batch to be of the same length. Instead of padding the input data to the longest length of the entire dataset, the collate function helps in padding the input data of the mini-batch to just the longest length of data within that mini-batch. This provides for faster training and less memory usage.
  6. Define the Model Class — the forward function of the DL Model is defined here. As it can be seen, the output from the last hidden layer (and of the first element of the output which is the CLS token) is taken from the Bert model and sent through Linear, Relu, Dropout layers before being finally sent through a Linear layer with 2 outputs for Binary Classification. The Output of the CLS token is considered to represent the meaning of the entire sentence. A get_outputs functions is defined in case we want to get the token embeddings of the last hidden layer for other downstream applications from the fine tuned model.
  • Define the Pytorch Lightning Module Class — This is where the training, validation and test step functions are defined. The model loss and accuracy are calculated in the step functions. The optimizers and schedulers are defined here as well — more than one of each can be defined and then used/switched within the step functions based on some criteria. For this simple case, we only use one Adam Optimizer and OneCycleLR scheduler.
  • Define the Trainer Parameters — All the required Trainer parameters and Trainer callbacks are defined here. We have defined 3 different callbacks — EarlyStopping, LearningRate Monitor and Checkpointing. Instead of using the argparse to define the parameters, the latest Pytorch Lightning update allows the definition of the parameters using a .yaml file — this .yaml file can be provided as an argument to a python .py file in a CLI run. This way the Trainer parameters can be maintained separate from the Training code. Since we are using a Colab Notebook for the demo purposes, we stick with the argparse way.
  • Train the Model — This is done using the Trainer.fit() method. A profiler can be defined in the Trainer parameters to give more information on the Training run timings.
  • Evaluate Model Performance — The binary or multi-class classification is one of the easiest downstream task to train a Language model for. You can see that we get greater than 92.6% accuracy after just one epoch on the IMDB Test dataset — it will improve further on using the Bert-base model instead of the DistilBert model. The SOTA on IMDB classification is 97.4% achieved in 2019 but the SOTA has used additional Training data. Even the DistilBert at 92.8% accuracy has used additional Training data. A Bert-Large got 95.49% accuracy with additional Training data.
  • Run Inference on the Trained Model — Send a sample batch text to the model using the predict method to get a prediction from the trained model. This can be used in building the ML inference pipeline.
  • TensorBoard Logs Data — This will open TensorBoard within the Colab notebook and let you look at the various TensorBoard logs. Pytorch Lightning logs default to TensorBoard and this can be changed using a Logger callback.

Next we will take a look at the Token Classification or the Named Entity Recognition (NER) task training in Part 3 of this Series.

--

--

Narayana Swamy
CodeX
Writer for

Over 17 years of diverse global experience in Data Science, Finance and Operations. Passionate about using data to unlock Business value.