NLP: Zero To Hero [Part 2: Vanilla RNN, LSTM, GRU & Bi-Directional LSTM]
Link to Part 1of this article:
NLP: Zero To Hero [Part 1: Introduction, BOW, TF-IDF & Word2Vec]
Link to Part 3 of this article:
NLP: Zero To Hero [Part 3: Transformer-Based Models & Conclusion]
Link to the Colab File:
https://github.com/PrateekCoder/NLP_Zero_To_Hero
This article is the continuation of NLP: Zero To Hero Part 1. In the previous article, we covered text pre-processing, and feature extraction and built sentiment analysis models using SVM and different vectorizers like BOW, TF-IDF, and Word2Vec. In this article, we will use Recurrent Neural Network-based models like Vanilla RNN, LSTM, GRU, and Bi-Directional LSTM. We will be building and training our model from scratch.
This an amazing article to understand how NLP work with Neural Networks:
https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Step 05: Building Models For Sentiment Analysis
Sentiment Analysis Model using Vanilla RNN
Vanilla RNN (Recurrent Neural Network) is a type of neural network that is used for processing sequential data. It is the simplest type of RNN, where the hidden state at the current time step is determined by the input at the current time step and the hidden state from the previous time step.
In the context of sentiment analysis, vanilla RNN can be used to predict the sentiment of a given sentence by processing the sentence word by word, and at each time step, updating the hidden state based on the current input word and the previous hidden state. The final hidden state of the RNN is then fed into a fully connected layer that predicts the sentiment of the sentence.
One of the main issues with vanilla RNN is the vanishing gradient problem, where gradients propagated back through the network become extremely small, making it difficult to learn long-term dependencies in the input sequence. As a result, more advanced RNN architectures like LSTM and GRU were developed to address this issue.
Here are the steps I followed to build a Vanilla RNN Sentiment analysis model:
- Label Encoding: The target variable is first label-encoded using Scikit-learn’s
LabelEncoder
so that the categorical labels can be represented numerically. - Splitting the Data: The data is split into training, testing, and validation sets using Scikit-learn’s
train_test_split
function. - Tokenization: The text data is then tokenized using Keras’
Tokenizer
class which converts the text into a sequence of integers. - Padding: The sequences are then padded to ensure that all of them have the same length. This is done using Keras’
pad_sequences
function. - Defining the RNN Model: A sequential model is defined using Keras’
Sequential
class. An embedding layer is added to the model, followed by a SimpleRNN layer, a dropout layer to prevent overfitting, and a dense layer with softmax activation. - Compiling the Model: The model is then compiled using an optimizer, a loss function, and a metric to measure performance.
- Early Stopping: Early stopping is set up using Keras’
EarlyStopping
callback to prevent overfitting and improve training efficiency. - Training the Model: The model is trained on the training set using Keras’
fit
function. - Evaluating the Model: Finally, the model is evaluated on the test set using Keras’
evaluate
function to measure its performance.
Accuracy of Vanilla RNN Model with pre-processed data: 82.34%
Accuracy of Vanilla RNN Model with raw text and early stopping: 85.64%
You can perform hyperparameter tuning to get better accuracy, with just applying using raw text instead of pre-processed text and applying early stopping I was able to improve the total accuracy by almost 3%.
Sentiment Analysis Model using LSTM RNN
LSTMs are designed to address the vanishing gradient problem that can occur in traditional RNNs. They have a unique memory cell that is responsible for storing information over long periods of time, and three gating mechanisms that control the flow of information into and out of the memory cell. The gates are called the input gate, forget gate, and output gate, and they help the model selectively remember or forget information based on the input and past context.
LSTMs also have a series of layers that are connected through time, allowing the model to analyze sequences of inputs over time. These layers consist of a cell state that runs through all the time steps of the LSTM, and hidden states that act as short-term memory to store information about the recent past.
During training, the LSTM learns to update the cell state and hidden state based on the input and past context, and the final hidden state is fed into a dense layer that predicts the output.
Overall, LSTMs are effective in handling sequential data and have been shown to perform well in NLP tasks such as sentiment analysis.
Here is the best resource you can find to learn about LSTM in detail:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
So building an LSTM model is very similar to building a simple RNN model, all the steps are very similar:
- Splitting the data into train, test, and validation sets using
train_test_split
function fromsklearn.model_selection
. - Tokenizing the text data using
Tokenizer
class fromtensorflow.keras.preprocessing.text
. - Converting the tokenized text data to sequences using
texts_to_sequences
method of theTokenizer
class. - Padding the sequences to a maximum length using
pad_sequences
method fromtensorflow.keras.preprocessing.sequence
. - Creating the LSTM model using
Sequential
class fromtensorflow.keras.models
and adding layers usingadd
method. The layers added areEmbedding
,LSTM
,Dropout
, andDense
. - Compiling the model using
compile
method with optimizer set to 'adam', loss set to 'categorical_crossentropy', and metrics set to 'accuracy'. - Setting up early stopping using
EarlyStopping
callback fromtensorflow.keras.callbacks
. - Training the model using
fit
method with training data and validation data, and callbacks set toearly_stop
. - Evaluating the model using
evaluate
method with test data and printing the test accuracy.
Accuracy of LSTM RNN Model with pre-processed data: 85.44%
Accuracy of LSTM RNN Model with raw text and early stopping: 86.40%
I was able to improve the model performance by 1% when I used raw text data, applied early stopping, and added a dense layer to the model.
Sentiment Analysis Model using GRU RNN
The basic idea behind a GRU(Gated Recurrent Unit) is similar to an LSTM, in that it uses gates to control the flow of information within the network and avoid the vanishing gradient problem.
In a GRU, there are two gates — a reset gate and an update gate — that control how much of the previous hidden state is passed to the next time step, and how much of the new input is used to update the hidden state. The reset gate determines which parts of the previous hidden state are no longer relevant and should be ignored, while the update gate determines which parts of the new input should be incorporated into the hidden state.
Compared to an LSTM, a GRU has fewer parameters and is, therefore, faster to train, but may not perform as well on more complex NLP tasks. To use a GRU for sentiment analysis, you would follow similar steps to those used for an LSTM or other RNNs, such as tokenizing the text, padding the sequences, and building a model using the Keras API.
Accuracy of GRU RNN Model with pre-processed data: 86.06%
Accuracy of GRU RNN Model with raw text and early stopping: 85.60%
The GRU model performed almost the same with pre-processed data and with the raw text and early stopping.
Sentiment Analysis Model using Bi-Directional LSTM RNN
A bidirectional LSTM is a type of neural network architecture that is commonly used for natural language processing tasks, such as sentiment analysis. It is an extension of the traditional LSTM architecture that includes two LSTMs working in opposite directions. One LSTM processes the sequence from the beginning to the end (forward direction), while the other processes the sequence from the end to the beginning (backward direction).
The main advantage of using a bidirectional LSTM is that it can capture both the past and future context of a word in a sentence, which is particularly useful in understanding the sentiment of the sentence. This is achieved by concatenating the outputs of the two LSTMs at each time step, creating a combined representation of the sequence.
During training, the model learns the best weights for each connection between the neurons in the network using the backpropagation algorithm. During the evaluation, the trained model can predict the sentiment of new sentences based on the learned patterns in the training data.
We use similar steps as we did in Vanilla RNN and LSTM for Bi-Directional LSTM as well and train our model.
Accuracy of Bi-Directional LSTM RNN Model with pre-processed data: 85.18%
Accuracy of Bi-Directional LSTM RNN Model with raw text and early stopping: 86.74%
If you are wondering how does RNN model perform the vectorization, this would be helpful:
The text data is first tokenized using the Tokenizer class from Keras. The Tokenizer converts the text data into sequences of integer values, where each integer represents a specific word in the vocabulary.
After tokenization, the text data is then padded to ensure that all sequences have the same length using the pad_sequences function from Keras. This is necessary because the RNN model expects inputs of the same length.
Once the text data has been tokenized and padded, it is passed to the RNN model, which uses an Embedding layer to convert each integer value in the input sequences to a dense vector representation. The Embedding layer learns a low-dimensional representation of each word in the vocabulary based on its co-occurrence patterns in the input text data.
The output of the Embedding layer is then passed to the SimpleRNN layer, which applies a simple recurrent neural network to the sequence of input vectors. The SimpleRNN layer maintains a hidden state that is updated at each time step based on the current input and the previous hidden state. The output of the SimpleRNN layer is a single vector representation of the input sequence, which can be used for classification or other downstream tasks.
Finally, the output of the SimpleRNN layer is passed to a Dense layer with a softmax activation function, which outputs a probability distribution over the two possible classes. The model is trained using categorical cross-entropy loss and optimized using the Adam optimizer.
Overall, the RNN model vectorizes the text data by first converting each word to a dense vector representation using an Embedding layer and then applying a SimpleRNN layer to the sequence of input vectors. The output of the SimpleRNN layer is a single vector representation of the input sequence that can be used for classification.
You can definitely combine the RNN model and other vectorization methods like using word2vec and see how that performs.
So we have covered all the RNN models except the Transformer based models. Surprisingly, none of the custom-built RNN models performed better than SVM with BOW or Tf-IDF models, I am pretty sure we can build a better model using RNN but it would require more training and hyperparameter optimization.
In the next article, I will be using Transformer based models for our sentiment analysis task and compare that with all the models we have built until now.
Here is the link to NLP: Zero To Hero [Part 3: Transformer-Based Models & Conclusion].