Evolution of NLP — Part 2 — Recurrent Neural Networks

An Introduction to Deep Learning for NLP Sentiment Classification

Kanishk Jain
Analytics Vidhya
Published in
10 min readJul 25, 2020

--

In the first part, we explored TF-IDF, Bag of Words alongside Ensemble Decision Tree Methods for Text Classification. Feel free to check that out to see our baseline score. This time we try and use a new approach — RNN.

To access the first article in this series, click here — Evolution of NLP — Part 1 — Bag of Words, TF-IDF

To access the complete code for this tutorial, feel free to check out this Kaggle Notebook

Recurrent Neural Networks are a special kind of Neural Networks, that unfold over time. Standard models, and even neural networks for example do not take into account the context of each word, that is being fed into a model. This restricts the accuracy achievable as you are always guessing the context while analyzing each word independently. This is where RNN come in.

RNN take input words sequentially — one at a time — and pass on the output generated for each word, back to the model, alongside the next word. This helps models to understand the context in which the word was used and predict accordingly.

A simple classifier based on RNN would have a structure shown below. An input statement with words passed on one-by-one, and at the end the final result is the prediction.

Here X0, X1, X2…Xt are words of a sentence, which are fed to the network one by one. RNN cells create the output in form of a hidden state h0, h1, h2….ht which is also passed ahead!

Image is taken from “Understanding LSTM Networks” — Colah’s Blog

This idea of processing sentences by taking each input word sequentially is actually quite intuitive and resonates quite well with how we as humans understand a sentence as well!

This is the reason RNNs have been applied across variety of applications ranging from Sentiment Analysis to Neural Translation, Image Captioning, etc.

However, there are many other sources which can shed light on what can be done with these networks. Some of the interesting ones are Andrej Karpathy’s Blog and Understanding LSTM Networks, Colah’s Blog. Next, let’s talk about how we can train these models.

Training RNNs and Issues with Long-Term Dependency

Their unique structure also makes it particularly interesting to understand how they learn the weights. RNNs use something called Backpropagation Through Time, where weights are updated sequentially over time. For a sentence with n words, RNN unfolds n times, i.e. there are n different steps at which an input is passed into the layer, which contributed to the overall error. As we update the weights based on gradients, we find that these gradients end up multiplying n times (courtesy of the Chain Rule for Differentiation) till we reach the first word. And herein lies the problem!

For a complete understanding of the maths behind the weight updates and back-propagation, refer to these excellent sources — Recurrent Neural Network, Brilliant and Denny Britz’s series

But, if you are simply plugging the model in and running the scripts, you don’t need to worry about this. What you do need to worry about is how to handle various kinds of errors that do pop up while training your model! Let’s talk about the most serious one next!

Vanishing Gradient

While training the model, all of these updates (read gradients) are either smaller than 1 or more than 1, taking exponential powers of this quickly lead to two critical issues with RNN — Vanishing Gradient and Exploding Gradient, respectively. Let’s see how this happens.

Imagine you have a deep neural network. Say you get a gradient of 0.6 for the last layer. Now, you do need to know the math to understand the next part (and above tutorials would help), but if you are short on time, get this — the gradient updates for initial layers are actually products of gradients of later layers. So, the gradient for those initial layers starts looking roughly like (0.6*0.6*0.6 …. n times) This number quickly becomes very small. With n =10, we end up with 0.006.

This is what we call Vanishing Gradients.

As the names suggest, in first cases, gradients turn too small and thus don’t lead to any change in model weights, while in the latter case, they tend to become too large that leads to egregious shifts in model weights, again failing to improve learning.

In most of the cases, we use activation functions (tanh, sigmoid) whose gradients are lower than 1, hence Vanishing Gradient is the most prominent concern.

The intuition behind Vanishing Gradient & Long-Term Dependency

The problem of Vanishing Gradient can be intuitively understood if we think of Long-Term Dependency.

In most cases, looking at only the last few words is enough to understand the context of sentence and make a prediction. The best examples could be sentences where the nouns, which need to be compared are relatively close, like — “fish live in ___” — almost very quickly we can connect and say the answer is “water”. These kinds of cases are easy pickings for RNNs, where since the gap is low, the context is preserved while making predictions!

But, problems with RNN start to crop up, when instead of shorter sentences, we start to process complicated, or longer paragraphs — “The war devastated most of France including his small town….but despite that he preserved his link to his birthplace by always speaking in ____” — here since the sentences are separated by long gaps, and much more complicated, being able to predict “French” for the blank sentence is much more difficult.

It’s important to note that as we start to work with longer and longer sentences, these problems grow too difficult for normal RNNs to address

Enter LSTMs!

To address this issue, LSTMs (Long Short Term Memory) and GRUs (Gated Recurring Unit) were introduced. Even though LSTMs have more weights to tune, both have a similar idea behind their structure — the goal is to preserve the context/information present in the initial part of the statement, by preventing the issue of Vanishing Gradient.

In this post, I’ll walk you through implementing LSTMs. For a detailed and intuitive explanation, refer to the Understanding LSTM Networks, Colah’s Blog post.

One of the most important elements of an LSTM is the Cell State. Intuitively, you can think of this as the overall context of a sentence. In an LSTM, this functions like a conveyor which flows through all the time steps, with updates as and when new data is added.

The changes to the cell state are regulated using another crucial element of an LSTM cell — gates!

LSTMs have 3 gates — in addition to the layers in normal RNN.
1. The “forget gate” controls what part of the previous information is retained in the cell state, and what is not — essentially a sigmoid with previous time step’s hidden value and input data for current time step,

2. The “input gate” controls what new information is to be updated in the cell state. This consists of 2 components — a sigmoid over new inputs and previous time step and a tanh over the same values to prepare candidate data for the Cell State, and finally

3. The “output gate” which decides what to output from the new updated Cell State. This will generate output, i.e. our hidden state. And the cycle repeats in the next cell, which takes this hidden state, along with new input data and our updated cell state as inputs

Intuitively, these gates ensure that the relevant context/information necessary to make the right prediction is retained. Essentially, preserving the context from getting lost over long sequence lengths.

And mathematically, the idea is to make sure the gradients don’t approach 0 for words present at the beginning of sentences, as we perform Back-Propagation through this chain. While calculating gradients, the individual partial differentials are not all very small (i.e. numbers<<1) getting multiplied, instead it’s the sigmoids, which get multiplied. If a position’s context is important, our model can learn to keep it 1, thereby ensuring that the gradient on these positions is significant, meaning the weights don’t saturate. This simple mathematical trick, including sigmoids to regulate information flow, greatly improved the effectiveness of these RNNs in NLP tasks.

To get a better in-depth mathematical understanding of how the gradients for these two are different, refer to this excellent tutorial.

Enough talk, let’s see how to implement a simple LSTM network in Keras.

To set up a bit of context, the data we are using is from the competition JantaHack NLP Hackathon. This dataset essentially consists of Steam User Reviews for different kinds of games, collected during 2015–2019. The goal here is to predict whether based on the user review, the user recommends or doesn’t recommend the game. So, our goal is essentially Sentiment Classification.

Step 1. Pre-Processing

All the preprocessing steps remain the same here! Please refer to the first blog where I share the details of this step.

Step 2. Tokenization & Padding

In the Tokenization step, the sentences are converted to individual words, or tokens, similar to previous models. In Padding Step, the length of inputs to the LSTM model is homogenized. Essentially, since all the sentences would have different lengths, sentences that are shorter than the maximum get padded with 0s. This is done to make sure that all the inputs are of the same length.

max_words = 15000
max_len = 400
tokenizer = Tokenizer(num_words=max_words, split=' ')
tokenizer.fit_on_texts(dataset3['user_review'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(dataset3['user_review'].values)
X = pad_sequences(X, max_len)

I’m limiting my vocabulary size, using max_words to 15000, and max sentence length to 400. By default, while padding using pad_sequences method, padding is added before the sentence, however, it can be changed to post, using truncating argument.

Step 3. Modeling

Y = pd.get_dummies(dataset3['user_suggestion']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
embed_dim = 100
lstm_out = 128
model = Sequential()

Let’s start off by splitting our dataset into Train and Test. I’ve also added dimensions for my embeddings and LSTM cell neurons. Let’s first talk about Embeddings

Embeddings

To start off, we convert our one-hot encoded tokens into Embeddings.

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

Intuitively, Embeddings are essentially multidimensional representations of words, which take into account their meanings as well! For example, words like “king” would be closer to “man”, instead of “woman”. Also, the gap between the words would be consistent, like the embeddings of king — man = queen — woman. These are usually trained using a huge corpus of data, using ANNs, with help from linguistic rules. Some of the popular embeddings are glove, word2vec, etc.

For the first part of this exercise, I’m not choosing any of the pre-trained embeddings, instead, I’ll simply let my model learn them for this specific case.

model.add(Embedding(max_words, embed_dim, input_length = max_len))

Dropout

The dropout layer drops neurons randomly during the training, with the probability defined by us. The idea is to make the model robust to fluctuations/noise in the data and forcing it to learn a more general pattern. This essentially helps in preventing overfitting of models, and along with BatchNorm, this the most commonly used regularization method in Deep Learning.

There are multiple kinds of Dropouts that can be used, varying from SpatialDropout to recurrent_dropout within LSTM cell. However, I’m using the normal Dropout layer with a 0.5 chance of dropping any neuron.

model.add(LSTM(lstm_out))
model.add(Dropout(0.5))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2,activation='softmax'))

Now, we’ve added all the layers we need. Let’s compile this model

model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
batch_size = 128model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 1, validation_split = 0.2)

With this model, we manage to reach around 83% in Epoch 3, after which the model starts to over-fit. To improve this further, we use pre-trained word embeddings from Glove and Bidirectional LSTM. These cells look at forward as well the backward context of a word while generating output.

glove_dir = '../input/glove-global-vectors-for-word-representation/'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
embedding_dim = 200
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
if i < max_worads:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

From this, we get weights for our tokens, and we pass these weights to our new model.

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length = max_len, weights = [embedding_matrix]))
model.add(SpatialDropout1D(0.4))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.5))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Accuracy — 87%

This is an improvement over our previous approach! Further fine-tuning the model with better dropout, and more epochs can actually improve this result!

Just with a little effort, we were able to step up from an 83% score in the first post to 87% now. Feel free to play around with the Kaggle Notebook on your own to see if you can raise the score. Some tips to try would be — adding BatchNormalization Layer, instead of Dropout, adding more LSTM layers, changing the optimizer, and making changes to learning rates.

If you want to learn more, definitely check out the Tensorflow in Practice Specialization on Coursera. It’s an excellent resource to learn using Tensorflow hands-on for Deep Learning.

In the next blog of this series, let’s see how we can actually use Transfer Learning for NLP!

Get ready, we are entering the big leagues now!

--

--

Kanishk Jain
Analytics Vidhya

Business Analyst @ P&G | IIT-Bombay alumni | Machine Learning and an AI enthusiast. I enjoy learning and writing about cutting-edge innovations in AI