Text -LSTM(Beginner Guide)

LSTM in Text Classification (Word Embedding)

Raqueeb Shaikh
Analytics Vidhya
5 min readOct 11, 2019

--

I am new to Machine Learning . I couldn't find a proper visual representation of LSTM for text and how the Keras LSTM Unit are arranged in LSTM Network. So this is my easiest possible explanation of it. Before we got into the article I would highly recommend you to check the articles below from where I have learned about LSTM.

LSTM (Long Short Term Memory) are advance versions of RNN (Recurrent Neural Network). The major problem of RNN was that it could not remember long term information as such in Text,Time Series data where information relevant information from the past is to be stored for future use. This is termed as vanishing gradient descent where the weights are updated on the initial layers but as we good much farther to the initial layers of the network the weights are updated very less or none at all.

An example of the Vanishing Gradient Descent problem would be if your weights are updated by 0.0000000001 then those layers wont learn at all or will learn very slowly because of the Vanishing Gradient.

The above problem is solved in LSTM as we have Memory Unit which stores the previous information that is relevant for the network.

LSTM Cell

LSTM Cell ( Source — http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Below is the representation of the internal of a Single LSTM cell and how the information flows from one Gate to another and also from one cell to another .

Forget Gate

The Used of the Forget Gate in to forget the information in the Cell State . The Forget Gate uses Sigmoid activation function which outputs values ranging from (0–1). 0 being to forget the information completely and 1 being to store the information for the next time step. An example of Forget Gate would be as follows

I live in Pune but right now I stay in Hyderabad. So you can drop the order at my present location .

In the above sentence we can see that the location Pune should be dropped in the next time step as the location is changed to Hyderabad. This is the small example of how the Forget Gate works .

Input Gate

Input Gate decides which values we will update in Cell State.It passes the values through the Sigmoid Activaton which gives the amount of new information that is to be passed through. Looking at the previous example again

This below gate takes the information from the state and applies tanh activation function to in. It proposes the new information that is to be added to the cell state.

The below layer is the tanh activation layer which give the values that could be added to the Next Cell state. It proposes the new information that is to be added to the cell state.

Output Gate

The output gate applies Sigmoid activation function and gives us the information of which part of the cell state to output to the final state:

Updated/Next Cell State

This is the final Updated Cell state which is passed to the next cell or taken as output:

Updated/Next Hidden State

This is the final hidden state which is given to the next cell or taken as the output:

LSTM representation for 3 Hidden Unit

Sentence — Good Day
Embedding Dimension for each word = 5
LSTM Hidden Unit = 3

inp = Input(shape=(2,))
x = Embedding(50000, 5)(inp)
x = LSTM(3, return_sequences=True)(x)
x = Flatten()(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_13 (InputLayer) (None, 2) 0
_________________________________________________________________
embedding_13 (Embedding) (None, 2, 5) 250000
_________________________________________________________________
lstm_16 (LSTM) (None, 2, 3) 108
_________________________________________________________________
flatten_5 (Flatten) (None, 6) 0
_________________________________________________________________
dense_11 (Dense) (None, 1) 7
=================================================================
Total params: 250,115
Trainable params: 250,115
Non-trainable params: 0
_________________________________________________________________
None

The below diagram is how single cell calculation is done in LSTM with 3 Hidden Unit.

I haven’t calculated the values in the LSTM its just example values

The Sentence that we have is ‘Good Day’ therefore two STM cell are required to pass the information to the next cell. The below diagram represents how the information is passed from one cell to another.

In the above Diagram you can see how ‘return_sequence’ and ‘return_state’ output the values.

Return Sequence

return_sequence = True

It output the values for every time step as shown in above diagram

return_sequence = False

It output the values of last time step as shown in above diagram

Return State

return_state = True

It output the values of last time step and last cell state shown in above diagram

If you find any information in this article that i might have written wrong please do let me know about it at raqueebilahi@gmail.com or in comment section

I am also looking for job as Data Scientist in Pune,India. If you have any vacancy i would love get in touch with you. Please do let me know at raqueebilahi@gmail.com

--

--