What does it mean by Bidirectional LSTM?

Published in

Analytics Vidhya

7 min readFeb 9, 2021

This has turn the old approach by giving an input from both the direction and by this it can remember the long sequences.

In my previous article we discussed about RNN, LSTM and GRU. Now, there are certain limitations are still persist with LSTM because it is not able to remember the context for a longer period of time.

Sequential processing in LSTM, from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

You can see in this LSTM architecture that information is still have to pass from longer path. LSTM and GRU are introduced to overcome the problem of vanishing gradient and sequential data memory but the architecture of both are having multiple sequential path. Thus, vanishing gradient problem is still persist. Also, LSTM and GRU can remember sequences of 10s and 100s but not 1000s or more.

Bidirectional Network

Now, when we are dealing with long sequences of data and the model is required to learn relationship between future and past word as well. we need to send data in that manner. To solve this problem bidirectional network was introduced. We can use bidirectional network with LSTM and well as RNN but dur to limitations of

In bidirectional LSTM we give the input from both the directions from right to left and from left to right . Make a note this is not a backward propagation this is only the input which is given from both the side. So, the question is how the data is combined in output if we are having 2 inputs.

Generally in normal LSTM network we take output directly as shown in first figure but in bidirectional LSTM network output of forward and backward layer at each stage is given to activation layer which is a neural network and output of this activation layer is considered. This output contains the information or relation of past and future word also.

Let’s take an example, assume we are having a sentence like

Here we can not predict the next word with normal RNN network but this can be solved in bidirectional RNN network. Also, RNN network can be LSTM or GRU.

Implementation of Bidirectional RNN on Tensorflow(Keras)

Tensorflow implementation

# Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
import tensorflow
from tensorflow.keras.layers import Embedding,LSTM,Dense,Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer# Load the dataset 
dataset can be found on kaggle Fake news classifier data https://www.kaggle.com/c/fake-news/data#data = pd.read_csv('train.csv')
# check how many values are none and we have to drop it.
data.isnull().sum(axis=0)

We will get below null values as a result so we need to drop it .

id           0
title      558
author    1957
text        39
label        0
dtype: int64df = data.dropna()

We have deleted all null values so that it can not affect the accuracy of the model . Now we will define X and Y as an independent and dependent variable

x = df.drop(‘label’,axis=1)
y = df[‘label’]

Now, the key part of NLP is text preprocessing which we perform on independent variable using NLTK library. We will use re library to remove punctuations then we will pass the data from stop words list and then do stemming on the data.

sentences = x.copy()
sentences.reset_index(inplace=True)
nltk.download('stopwords')ps = PorterStemmer()
corpus = []
for i in range(0, len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences['title'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)#here we can see that corpus contains the words after preprocessing done.
corpus[:10]['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag']

Now we will one hot encode the data as we have word list and we will get the index w.r.t vocab_size

vocab_size = 5000
onehot = [one_hot(words,vocab_size) for words in corpus]
onehot[:10]3090, 3921, 277, 3803, 561, 2494, 2349],
 [792, 2085, 3099, 206],
 [1083, 2836, 2939, 3433, 2700, 2344],
 [4308, 561, 750, 666, 2017, 2368, 561, 415, 869, 208],
 [4623,
  3529,
  4621,
  2659,
  3924,
  4115,
  2845,
  2475,
  4603,
  4988,
  1575,
  959,
  92,
  1630,
  2349],
 [4348, 1735, 189, 352, 3582, 757, 60, 4393, 373, 1561, 684],
 [1554, 61, 1640, 1548, 2048, 1673, 4115, 2910, 373, 1561, 684],
 [2484, 2659, 3720, 3690, 509, 4227, 4554, 310, 4115, 1168],
 [653, 1048, 646, 2146, 2026, 1062, 3558, 4097]]

Next step is padding, as the sentences we have are different in size so we have to do padding to make them equal in length. We can use pre or post padding.

length = 30
embedding = pad_sequences(onehot,maxlen=length,padding=’pre’)
embedding[:10]array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0, 1076, 2004,
        3855, 3964, 1251, 1177, 2432, 4548, 4821, 3157],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0, 3090, 3921,  277, 3803,  561, 2494, 2349],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,  792, 2085, 3099,  206],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0, 1083, 2836, 2939, 3433, 2700, 2344],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0, 4308,  561,
         750,  666, 2017, 2368,  561,  415,  869,  208],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0, 4623, 3529, 4621, 2659, 3924, 4115, 2845,
        2475, 4603, 4988, 1575,  959,   92, 1630, 2349],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0, 4348, 1735,  189,
         352, 3582,  757,   60, 4393,  373, 1561,  684],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0, 1554,   61, 1640,
        1548, 2048, 1673, 4115, 2910,  373, 1561,  684],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0, 2484, 2659,
        3720, 3690,  509, 4227, 4554,  310, 4115, 1168],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
         653, 1048,  646, 2146, 2026, 1062, 3558, 4097]])

From above all output we can see that how our sentences are preprocessed for the LSTM input. Now we can implement model to train on our data.

embedding_vector_features = 40
model = Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=length))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(1,activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’,optimizer=’adam’,metrics=[‘accuracy’])
print(model.summary())# Model Summary 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 30, 40)            200000    
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               112800    
_________________________________________________________________
dense (Dense)                (None, 1)                 201       
=================================================================
Total params: 313,001
Trainable params: 313,001
Non-trainable params: 0
_________________________________________________________________
None#Split the data into training and testing datasetX = np.array(embedding)
Y = np.array(y)
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.33,random_state=42)
model.fit(X_train,y_train,batch_size=64,epochs=20,validation_data=(X_test,y_test))

Post trainign on 20 epochs I got below result.

Epoch 1/20
192/192 [==============================] - 9s 45ms/step - loss: 0.3227 - accuracy: 0.8448 - val_loss: 0.2034 - val_accuracy: 0.9152
Epoch 2/20
192/192 [==============================] - 8s 39ms/step - loss: 0.1433 - accuracy: 0.9411 - val_loss: 0.1821 - val_accuracy: 0.9238
Epoch 3/20
192/192 [==============================] - 8s 44ms/step - loss: 0.0897 - accuracy: 0.9668 - val_loss: 0.2037 - val_accuracy: 0.9218
Epoch 4/20
192/192 [==============================] - 10s 54ms/step - loss: 0.0573 - accuracy: 0.9803 - val_loss: 0.2556 - val_accuracy: 0.9193
Epoch 5/20
192/192 [==============================] - 10s 52ms/step - loss: 0.0307 - accuracy: 0.9903 - val_loss: 0.3273 - val_accuracy: 0.9158
Epoch 6/20
192/192 [==============================] - 9s 49ms/step - loss: 0.0170 - accuracy: 0.9956 - val_loss: 0.3483 - val_accuracy: 0.9168
Epoch 7/20
192/192 [==============================] - 9s 47ms/step - loss: 0.0138 - accuracy: 0.9950 - val_loss: 0.4654 - val_accuracy: 0.9099
Epoch 8/20
192/192 [==============================] - 10s 55ms/step - loss: 0.0066 - accuracy: 0.9980 - val_loss: 0.5041 - val_accuracy: 0.9117
Epoch 9/20
192/192 [==============================] - 11s 58ms/step - loss: 0.0063 - accuracy: 0.9987 - val_loss: 0.5213 - val_accuracy: 0.9100
Epoch 10/20
192/192 [==============================] - 14s 72ms/step - loss: 0.0042 - accuracy: 0.9989 - val_loss: 0.5411 - val_accuracy: 0.9079
Epoch 11/20
192/192 [==============================] - 14s 73ms/step - loss: 0.0053 - accuracy: 0.9984 - val_loss: 0.5063 - val_accuracy: 0.9122
Epoch 12/20
192/192 [==============================] - 13s 68ms/step - loss: 0.0056 - accuracy: 0.9980 - val_loss: 0.5529 - val_accuracy: 0.9087
Epoch 13/20
192/192 [==============================] - 13s 65ms/step - loss: 0.0054 - accuracy: 0.9982 - val_loss: 0.5330 - val_accuracy: 0.9095
Epoch 14/20
192/192 [==============================] - 13s 69ms/step - loss: 0.0021 - accuracy: 0.9995 - val_loss: 0.5599 - val_accuracy: 0.9118
Epoch 15/20
192/192 [==============================] - 11s 55ms/step - loss: 2.7169e-04 - accuracy: 1.0000 - val_loss: 0.6287 - val_accuracy: 0.9099
Epoch 16/20
192/192 [==============================] - 10s 51ms/step - loss: 1.1285e-04 - accuracy: 1.0000 - val_loss: 0.6487 - val_accuracy: 0.9097
Epoch 17/20
192/192 [==============================] - 11s 58ms/step - loss: 8.3286e-05 - accuracy: 1.0000 - val_loss: 0.6669 - val_accuracy: 0.9097
Epoch 18/20
192/192 [==============================] - 9s 48ms/step - loss: 6.5950e-05 - accuracy: 1.0000 - val_loss: 0.6818 - val_accuracy: 0.9092
Epoch 19/20
192/192 [==============================] - 9s 49ms/step - loss: 5.4030e-05 - accuracy: 1.0000 - val_loss: 0.6961 - val_accuracy: 0.9090
Epoch 20/20
192/192 [==============================] - 11s 59ms/step - loss: 4.4982e-05 - accuracy: 1.0000 - val_loss: 0.7104 - val_accuracy: 0.9094

We can test our model finally on testing data and can check the confusion metrix.

y_pred=model.predict_classes(X_test)
CM = confusion_matrix(y_test,y_pred)
score = accuracy_score(y_test,y_pred)
print(CM)
print(score)[[3122  297]
 [ 250 2366]]
0.9093620546810274

We got accuracy of 90%. However, we can improve this accuracy by working on different parameters like vocab_size, sentence length, LSTM layer size, number of epochs.

Reference

References

[1] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory (1997), Neural Computation

[2]http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[3]https://www.youtube.com/watch?v=MXPh_lMRwAI&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=22

[4]https://medium.com/@raghavaggarwal0089/bi-lstm-bc3d68da8bd0

What does it mean by Bidirectional LSTM?

References

Written by Jaimin Mungalpara