A nostalgic trip down RNN Lane — Rudimentary Text Classification using Multi GPUs

Kaustav Mandal
exemplifyML.ai
Published in
6 min readAug 18, 2022

Back in the day, RNN and its offshoots were considered the de-facto model when it came to NLP tasks.

Recurrent Neural network works on sequences of characters or words using two items, a tensor hₜ-₁ and an input tensor xₜ. The tensor hₜ-₁ has the data representing all the inputs it has processed so far, and xₜ is the current input (i.e word tensor) to be processed.

Figure 1: Basic RNN architecture (Image by Author)

RNNs can remember sequences as they use previous outputs for computing the next inputs.

One of the attributes of the RNN is the concept of weight sharing, where the the same weights are used for the computing the activation's across the network. This tied into the concept of the remembering prior states as we are using the same weights over time.

For every word(x) in the sentence, we mark the positions in time(t), compute its output(y) and hidden state(h) by using the previous hidden state at (t -1).

Figure 2: Basic expanded RNN architecture (Image by Author)

The activation function is defined by the following variables:

The activation function (A), uses the same set of weights across the steps in the RNN, i.e one for the previous state information(Wᵖʳᵉᵛ) and one for the current step input(Wᶜᵘʳ)

Types of RNN networks:

There are many times of RNN architectures, as illustrated below, however for conciseness, we will stick to the structure of a basic RNN which is used for classification.

Figure 3: RNN Types, Image Reference — http://karpathy.github.io/2015/05/21/rnn-effectiveness

That’s the basic idea of RNN; let’s work through a small example in Pytorch.

RNN in Pytorch:

Pytorch has an inbuilt RNN module which we can leverage for building our own layered RNNs.

For this tutorial, we will use the IMDB reviews dataset from Kaggle, for classifying reviews as positive or negative sentiment.

Steps:

  • Generate vocabulary as character level embeddings or word level embeddings for the english language.

    Note: Char level embeddings are generally used for language identification tasks.
  • Preprocess the IMDB reviews dataset, removing non-unicode characters, stopwords etc.
  • Split raw data into train / test data segments, with samples for each class weighted proportionally
  • Convert strings to embeddings for training / inferences
    For character level embeddings, we can simply create a tensor for each IMDB review by substituting the vocabulary index of the individual character present in the review. In this way, we are able to create a vector of numbers representing the IMDB reviews.
    For word level embeddings, one can tokenize using NLTK or any other library and store it in an vocabulary list, whose index is used to embed the words in a review into vectors.
  • Load the Pytorch dataloaders, splitting the dataset, (for example 90:10 split) into train / validation sets, with each class balanced proportionally.

    Note: I did not find any stratified random distributed sampler I could use out of the box in Pytorch. I was able to leverage the following example from Kaggle for building out a stratified distributed sampler outlined here, with only minor tweaks.Minor changes while retrieving
  • For non multi GPU use-cases, there is a good post on the Pytorch forums regarding building out a stratified sampler leveragingsklearn — StratifiedKFold, referenced here.
  • Add gradient clipping as there are some long sequences in this IMDB reviews dataset; these would either lead to exploding or vanishing gradients.
    A good starting point for a clip value would be the average of the gradients for a trial training run of the RNN.
torch.nn.utils.clip_grad_norm_(model.parameters(), norm_type=2, max_norm=3)
  • Train / validate on various RNN, LSTM structures — 1 layer RNN/LSTM, 3 layer bi-directional RNN/LSTM
  • After training, run the model in evaluation mode for a sample hypothetical review as illustrated below.

Sample definition for 1 layer RNN based on (RNN Module, Embedding Module, Linear Module)

Note: For LSTM, simply swap out the RNN for the LSTM pytorch module.

Test Results(1 layer RNN) using word level embeddings:

For 12 epochs over 25k set per sentiment class, with a 90:10 split; we ended up with a decent degree of confidence for the sentiment.
Batch size — 32, Embedding dims (D) — 96, RNN hidden state dims — 64

Figure 4a: Prediction by 1 layer RNN using stratified dataset with word level embedding (Image by Author)
Figure 4b. 1 layer RNN using word level embeddings — train vs test accuracy over epochs (Image by Author)

Results(1 layer LSTM) using word level embeddings:

For 12 epochs over 25k set per sentiment class, with a 90:10 split; we ended up with a rather a poor degree of confidence for the sentiment.
Batch size — 32, Embedding dims (D) —96, LSTM hidden state dims— 64

Figure 4c: Prediction by 1 layer LSTM using stratified dataset with word level embedding (Image by Author)
Figure 4d. 1 layer LSTM using word level embeddings — train vs test accuracy over epochs (Image by Author)

Sample definition for 3 layer bi-directional RNN:

Note: For LSTM, simply swap out the RNN for the LSTM pytorch module.

When using multiple layers, we use batching to make the computation faster. However, as the sentence lengths differ from one to the next, we need to pack them into same size tensors for tensor operations on the GPU.

Pytorch provides packing/unpacking utils out of the box. For additional details, please review PackedSequence, torch.utils.pack_padded_sequence

We also need to balance our dataset not only for our train/test datasets, but also for each batch. Our batches should be balanced, with weighted samples in each batch, proportionate to the items per class in the dataset.(Please review the steps section for details)

Test Results(3 layer bi-directional RNN) using word level embeddings:

For 12 epochs over 25k set per sentiment class, with a 90:10 split; ended up with a high degree of confidence for the sentiment.

Figure 5a: Prediction by 3layer bi-directional RNN using stratified dataset with word level embedding (Image by Author)
Figure 5b. 3 layer bi-directional RNN using word level embeddings — train vs test accuracy over epochs (Image by Author)

Test Results(3 layer bi-directional LSTM) using word level embeddings:

For 12 epochs over 25k set per sentiment class, with a 90:10 split; ended up with a very high degree of confidence for the sentiment.

Figure 5c: Prediction by 3layer bi-directional LSTM using stratified dataset with word level embedding (Image by Author)
Figure 5d. 3 layer bi-directional LSTM using word level embeddings — train vs test accuracy over epochs (Image by Author)
Figure 6a: Accuracy rates between 1 layer RNN and 3 layer bi-directional RNN (Image by Author)
Figure 6b: Accuracy rates between 1 layer LSTM and 3 layer bi-directional LSTM (Image by Author)

The review used for prediction was:

The movie is visually stunning and the movie is very interesting.

References:

Special thanks to Edward Raff, author of the Inside Deep Learning book, whose concise explanation went a long way for providing the intuition behind RNNs. A lot of the code examples were leveraged from understanding the illustrated examples in that book.

--

--

Kaustav Mandal
exemplifyML.ai

Software Engineer with an interest in Machine Learning / Data science , ML Ops