What is LSTM , peephole LSTM and GRU?

Jaimin Mungalpara
Nerd For Tech
Published in
5 min readFeb 5, 2021


Long Short Term Memory (LSTM) was introduced by Hochreiter & Schmidhuber (1997) and it was refined by many researchers. LSTM is special kind of RNN which can remember long term dependencies. LSTM are specially designed to avoid the problems which are faced in RNN. You can learn about RNN in my previous article Understanding RNN. The architectural behavior made it strong t remember long term dependencies. In simple RNN network, it has simple repeating neural network such a simple tanh or relu network which is represented in below figure.

The repeating module in a standard RNN contains a single layer. Image taken from https://colah.github.io/

Well, LSTM is also having a same kind of chain structure but the repeating module does not have simple neural network architecture. These repeating module contains 4 neural networks.

LSTM Gates

The notations in above figure can be explained like this. In this notation yellow boxes represents neural networks. Pink dots are point wise operation weather it would be vector multiplication or addition. Merger of line is concatenation and splitting of line means same content is going at different location.

Image taken from https://colah.github.io/

There are 3 Gates in LSTM architecture which are

  1. Forget Gate
  2. Input Gate
  3. Output Gate

The Core Idea Behind LSTM

The top of the diagram contains cell state, it is a path in which information can pass easily with some minor liner operations. Gates can add or remove information in the cell state.

The gates are composed of sigmoid neural network and pointwise multiplication operation. As sigmoid given an output 0 & 1 this means if it is 0 then nothing would be passed threw and if 1 then everything would be passed.

Forget Gate

This is the first step in LSTM network which decides which information would be passed threw the cell state and this decision is taken by forget gate.

It taken ht-1 and xt and an input and output is given as 0 and 1 which is then point wise multiplied with Ct-1 and finally it will decide which information would be passes threw. If we are working with some context based data if the context is changed this forget cell will discard the information which is not relevant to context.

Input Gate

In this step will decide which information is going to be stored in cell state. This operation is one in two step. First, a sigmoid neural network decides which values we will update and a tanh layer that creates a vector of new candidate values, Ct, that could be added to the state.

Now we have to update cell state which we got at Ct-1 to Ct.

We multiply old state Ct-1 with ft and forget the things which are not required. Then we add it*Ct to update the context which we need to remember. Here we add new information in context to be remembered and passed to next stage.

Output Gate

At this stage, we have to decide about what we are going to send in output. This would be based on our cell state, but we run a sigmoid neural network which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we can decide about output data.

Here we can see whole idea in one gif about LSTM.

Peephole Architecture

Until now we have seen simple LSTM network but this architecture is modified along with time in each and every research paper. One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

In this peephole connection we can see that all the gates are having an input along with the cell state.


Another variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). it combines the forget and input gate into update gate which is newly added in this architecture. It also merges the cell state and hidden state. The resulting model is simpler than traditional LSMT and it is growing more popularity.

Entire GRU operation can be seen as .

Here , we checked some LSTM variants but there are another variants as well. All the drawbacks of RNN are already achieved in LSTM, still researchers are asking for another step which is called attention.


Understanding LSTM Networks — colah’s blogThese loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that…