LSTM — Introduction in simple words

Amit Singh Rathore
Nerd For Tech
Published in
3 min readSep 19, 2020

--

Image source google.com

LSTM — Long short term memory is an improvement over Recurrent Neural Network to address RNN’s failure to learn in the presence of past observations greater than 5–10 discrete time steps between relevant input events and target signals (vanishing/exploding gradient issue). LSTM does so by introducing a memory unit called “cell state”. Lets see the below diagram to understand LSTM’s basic building blocks.

LSTM Simplified

In the above diagram the central tanh activation function with hidden state and input constitutes a basic RNN cell. LSTM adds other layer as improvement. In the below diagram the cell state is the horizontal line that runs through the top. This cells state is what gives “Long” in LSTM. This cell state carries information or context over longer discrete steps (up to hundreds).

Long and short term memory in LSTM

The long term memory “cell state” behaves something similar to a conveyer belt in automated sorting machine, where parcels are added and removed.

Image source Misumi blog

Note there are only two updates happening to the “cell state”. This reduces the number of computations done. Hence giving stability and reducing chances of exploding descent.

In the below diagram we can see how an LSTM can be broken into layers for better understanding. We have three layers.

different layers in LSTM

Forget layer: This layer filters or removes info/memory from previous cell state based on current input and previous hidden state. This is done via a sigmoid activation function. This function results only 0 and 1 for inputs. Once it is multiplied to something either it will drop that(multiplication with zero) results in zero or completely pass through(anything multiplied by 1 is same)

element-wise multiplication which results in filtering

Input Layer: This has again a forget logic, which removes any unwanted information from current input. We also have a modulator which keeps the values in between -1 and 1. This is achieved using a tanh activation function.

Activation functions tanh and Sigmoid

Output Layer: This layer takes current input and current cell state and then outputs the hidden state and cell output. Again we use scaling (tanh) for cell state to keep values in range -1 to 1.

Note: We also introduce different biases at different layers.

Mathematically each layer can be summarized by equations mentioned in the below diagram.

equations for each layer

If you are interested in implementation of LSTM then you may want to go through this blog.

Happy learning!

--

--

Amit Singh Rathore
Nerd For Tech

Staff Data Engineer @ Visa — Writes about Cloud | Big Data | ML