Exploring different types of LSTMs
Introduction
Recently, We have worked on an interesting Project, in which, we have explored the Sentiment Analysis for the Movie Reviews Dataset from Kaggle using different types of LSTMs and achieved good accuracies respectively. Let us know about LSTM concept used in the project as follows:
LSTM
Long Short Term Memory Networks are a special kind of RNNs, capable of learning long-term dependencies.
RNNs have a simple structure in the repeated module which is single Tanh layer, as shown below:
Unlike RNNs, LSTMs have 2-input gates, forget gate and output gates interacting in a special way as shown in the below image:
Project Details:
We have worked on Sentiment Analysis for Movie Reviews dataset. Let us see how the dataset looks like in the image below:
As we can see the data has four columns, in which the Phrase column has the reviews provided by the users and for which the respective Sentiment column is divided into five categories i.e., from ‘0’(very bad) to ‘4’(very good), as shown below:
Now, as we have got an idea about the dataset, we can go with Preprocessing of the dataset.
Preprocessing
Now, by considering the Phrase column, we applied NLP text processing steps on the dataset as below:
Changing Text to Lowercase
train['Phrase']=train.Phrase.apply(lambda x: x.lower())
Identifying and removing Punctuations
def remove_punc(text):
for i in string.punctuation:
text=text.replace(i,' ')
return text
train['Phrase']=train.Phrase.apply(remove_punc)
Removing Stopwords
stopword_list=stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')
train['Phrase'] = train.Phrase.apply(lambda x : " ".join(x for x in x.split() if x not in stopword_list))
Word Tokenization
train['Phrase']=train.Phrase.apply(word_tokenize)
Removing Numbers
def remove_numbers(words):
new_words = []
for word in words:
new_word = re.sub("\d+", "", word)
if new_word != '':
new_words.append(new_word)
return new_words
train['Phrase']=train.Phrase.apply(remove_numbers)
Lemmatizing Verbs
def lemmatize_verbs(words):
lemmatizer = WordNetLemmatizer()
lemmas = []
for word in words:
lemma = lemmatizer.lemmatize(word, pos='v')
lemmas.append(lemma)
return lemmas
train['Phrase']=train.Phrase.apply(lemmatize_verbs)
Created Wordcloud for the data
After the preprocessing is completed, we split the data into train and test datasets. Now, we try to apply different LSTM Models and check for accuracy as follows:
Different LSTM Models
Classic LSTM
This architecture consists of 4 gating layers through which the cell state works, i.e., 2-input gates, forget gate and output gates. The input gates work together to choose the input to add to the cell state. The forget gate decides what old cell state to forget based on current cell state. The output gates decides what output to be sent through them.
We have applied Classic LSTM (Long Short Term Memory) to the training data for modelling and fit the model.
EMBEDDING_DIM = 128
lstm_out = 196
a = len(tokenize.word_index)+1
model = Sequential()
model.add(Embedding(a, EMBEDDING_DIM, input_length=max_len))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2 ))
model.add(Dense(5, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, batch_size=128, epochs=24, verbose=1)
Here, we have used one LSTM layer for the model and the optimizer is Adam, achieved an accuracy of 80% after around 24 epochs, which is good.
On this good note, explored the same dataset by applying different types of LSTMs, basically RNNs.
Stacked LSTM
The Stacked LSTM is nothing but an LSTM Model with multiple LSTM layers. The LSTM layer gives a sequential output to the next LSTM layer.
We have applied Stacked LSTM which is nothing but adding multiple LSTMs and fit the model.
EMBEDDING_DIM = 128
lstm_out = 196
a = len(tokenize.word_index)+1
model2 = Sequential()
model2.add(Embedding(a, EMBEDDING_DIM, input_length=max_len))
model2.add(LSTM(lstm_out, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model2.add(LSTM(lstm_out))
model2.add(Dense(5, activation='softmax'))
model2.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model2.fit(X_train, y_train, batch_size=128, epochs=24, verbose=1)
Used two LSTM layers for the model and the optimizer is Adam, achieved an accuracy of 80%.
Bidirectional LSTM
The Bidirectional LSTM trains two on the input sequence instead of one which means the first input sequence and the second is its reversed copy of the same. The improves the learning of the model more faster.
We have applied Bidirectional LSTM and fit the model.
from keras.layers import Bidirectional
EMBEDDING_DIM = 128
a = len(tokenize.word_index)+1
model3 = Sequential()
model3.add(Embedding(a, EMBEDDING_DIM, input_length=max_len))
model3.add(Bidirectional(LSTM(64, return_sequences=True)))
model3.add(Bidirectional(LSTM(64)))
model3.add(Dense(5, activation='softmax'))
model3.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model3.fit(X_train, y_train, batch_size=128, epochs=24, verbose=1)
Here, when we see, we not only used Bidirectional LSTM for the model but also with multiple layers, it is also stacked and the optimizer is Adam, here, we have achieved an accuracy of 81% after 24 epochs, but we can go further and train the model for better accuracy.
GRU(Gated Recurrent Unit)
The Gated Recurrent Unit Neural Networks basically consist of two gates i.e., Reset Gate and Update Gate. Reset Gates help capture short-term dependencies in sequences and Update Gates help capture long-term dependencies in sequences. Both the gates control how much each hidden unit has to remember or forget while working on the sequence.
We have applied GRU for the model and achieved an accuracy of 81%. we can find the code below:
model4 = Sequential()
model4.add(Embedding(a, EMBEDDING_DIM, input_length=max_len))
model4.add(GRU(64))
model4.add(Dense(5, activation='softmax'))
model4.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model4.fit(X_train, y_train, batch_size=128, epochs=24, verbose=1)
BGRU(Bidirectional GRU)
As Bidirectional LSTM, the Bidirectional GRU is also a Bidirectional RNN, which means, the BGRU is nothing but the GRU in a bidirectional manner.
We have applied BGRU for the model and the optimizer is Adam, achieved an accuracy of 79%, can achieve more if the model is trained for more epochs.
model5 = Sequential()
model5.add(Embedding(a, EMBEDDING_DIM, input_length=max_len))
model5.add(SpatialDropout1D(0.2))
model5.add(Bidirectional(GRU(64)))
model5.add(Dropout(0.2))
model5.add(Dense(5, activation='softmax'))
model5.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model5.fit(X_train, y_train, batch_size=128, epochs=24, verbose=1)
Conclusion
As a summary, we already know that these all LSTMs are subtypes of RNNs. It becomes difficult to train RNN to solve certain problems because of vanishing gradient problem, to overcome that, we use LSTM which uses a special unit along with the standard units, which have control over the memory when to forget and when to get output. GRU is an LSTM with simplified structure and does not use separate memory cells but uses fewer gates to control the flow of information.
From this project, we have done a complete NLP project with the utilization of Classic LSTM and achieved a good accuracy of about 80%. We went even further and have learnt about different types of LSTMs and their application using the same dataset. We achieved accuracies of about 81% for Bidirectional LSTM and GRU respectively, however, we can train the model for few more number of epochs and can achieve a better accuracy.
So, overall, the key takeaways from this project include basic knowledge about different types of LSTMs and their implementation for a dataset, as per our requirements.