Juggling with Long-Short Term Memory

Published in

DataX Journal

4 min readNov 10, 2020

Juggling with Long-Short Term Memory

Sometimes, we only need to look at recent information to perform the present task. LSTMs find a wide spectra of applications in Natural Language Processing and dealing with time-series data.
Example — Consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “The Sun ☀rises in the East.”, it is pretty obvious that it is going to be East➡ , isn't it?
In such cases where the gap between the relevant information and the place that it is needed is small, Recurrent Neural Networks can easily learn how to use past information.

Hold up, But then why are we learning about LSTMs❓

Now, consider trying to predict the last word in the text “I grew up in France… I still have a home in Paris.”
Here, recent information about my home in Paris needs the context of France, from further back. It is entirely possible for the gap between the relevant information and the point where it is needed to become very large.
RNNs are lack-ey here, they are unable to connect the information, so we use LSTMs.

:: LSTMs are a special kind of RNN, capable of learning long-term dependencies.

Structure of LSTMs -

LSTMs, like RNNs, also have a chain-like structure, but the repeating module has a different structure. Instead of just a single tanh layer, there are 4 layers in a single cell ie memory unit.

🔑 The key to LSTMs is the cell state, the horizontal line running through the top which runs straight down the entire chain of repeating modules, making it is easy for information to just flow through it unchanged.

LSTM has the ability to remove or add information to the cell state, regulated by gates. Gates are composed of a sigmoid neural network layer and a pointwise multiplication operation for the scaling of information.

LSTM has 3 gates, to protect and control the cell state.
🎯 Forget Gate layer-
* Decides what information is getting thrown away from cell state, based on the context of the input from the previous block ie h(t-1), and input from the current neural net layer ie x(t).
Example- Vignesh is a year older than Maria. Maria is French.
Here subject context changes from Vignesh to Maria so the LSTM forget gate can decide to forget information about Vignesh.

🎯 Input Gate layer-
* Decides what new information is going to get stored in the cell state.
* Sigmoid neural network layer decides which values are to be updated.
* Tanh layer creates a vector of new candidate values C(t), that could be added to the state, which lie between -1 and 1.
— To overcome the vanishing gradient problem posed by the Sigmoid function, we need a function whose second derivative can sustain for a longer range before clasping to 0 so that the LSTM retains updated information and values. Hence, we use the tanh layer.
Example- We would want to add the gender of Maria to the cell state to replace Vignesh’s gender.

🎯 Output Gate layer-
* Update old cell state C(t-1) to new cell state C(t)
* Pointwise multiplication operation between old state C(t-1) and f(t), which lies between 0 and 1, to forget the things we decided to forget earlier in the forget gate layer output.
Then we add i(t)*C(t) which make up the new candidate values, scaled by how much we decided to update each state value.

*Final output will be based on our cell state but will be a filtered version.
First, run a sigmoid layer which decides what parts of the cell state are going to output. Then, put the cell state through tanh and multiply it with the output of the sigmoid gate, to decide and output only parts of the cell state.

References:

⭐ https://colah.github.io/posts/2015-08-Understanding-LSTMs/
🌈 https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

Structure of LSTMs -

References:

Written by Stuti Sehgal