Introduction to Long Short-Term Memory (LSTM)

Archit Saxena
Analytics Vidhya
Published in
4 min readJan 17, 2023

In my previous article on Recurrent Neural Networks (RNNs), I discussed RNNs and how they work. Towards the end of the article, the limitations of RNNs were discussed. To refresh our memory, let's quickly touch upon the main limitation of RNNs and understand the need for modifications of vanilla RNNs.

The main limitation of RNNs is that RNNs can’t remember very long sequences and get into the problem of vanishing gradient.

What is the vanishing gradient problem?

The gradients of the loss function in neural networks approach zero when more layers with certain activation functions are added, making the network difficult to train.

Vanishing gradient; Source: Medium

Long Short-Term Memory (LSTM)

LSTMs come to the rescue to solve the vanishing gradient problem. It does so by ignoring (forgetting) useless data/information in the network. The LSTM will forget the data if there is no useful information from other inputs (prior sentence words). When new information comes, the network determines which information to be overlooked and which to be remembered.

LSTM Architecture

Let’s look into the difference between RNNs and LSTMs.

In RNNs, we have a very simple structure with a single activation function (tanh).

RNN network; Source: colah’s blog

In LSTMs, instead of just a simple network with a single activation function, we have multiple components, giving power to the network to forget and remember information.

LSM network; Source: colah’s blog
Notations used

LSTMs have 4 different components, namely

  1. Cell state (Memory cell)
  2. Forget gate
  3. Input gate
  4. Output gate
LSTM components; Source: Turing’s blog

Let’s understand these components, one by one.

1. Cell State (Memory cell)

It is the first component of LSTM which runs through the entire LSTM unit. It kind of can be thought of as a conveyer belt.

LSTM Cell State; Source: colah’s blog

This cell state is responsible for remembering and forgetting. This is based on the context of the input. This means that some of the previous information should be remembered while some of them should be forgotten and some of the new information should be added to the memory. The first operation (X) is the pointwise operation which is nothing but multiplying the cell state by an array of [-1, 0, 1]. The information multiplied by 0 will be forgotten by the LSTM. Another operation is (+) which is responsible to add some new information to the state.

2. Forget Gate

The forget LSTM gate, as the name suggests, decides what information should be forgotten. A sigmoid layer is used to make this decision. This sigmoid layer is called the “forget gate layer”.

LSTM Forget Gate; Source: colah’s blog

It does a dot product of h(t-1) and x(t) and with the help of the sigmoid layer, outputs a number between 0 and 1 for each number in the cell state C(t-1). If the output is a ‘1’, it means we will keep it. A ‘0’ means to forget it completely.

3. Input gate

The input gate gives new information to the LSTM and decides if that new information is going to be stored in the cell state.

LSTM Input gate; Source: colah’s blog

This has 3 parts-

  1. A sigmoid layer decides the values to be updated. This layer is called the “input gate layer”
  2. A tanh activation function layer creates a vector of new candidate values, Č(t), that could be added to the state.
  3. Then we combine these 2 outputs, i(t) * Č(t), and update the cell state.

The new cell state C(t) is obtained by adding the output from forget and input gates.

LSTM new Cell state; Source: colah’s blog

4. Output gate

The output of the LSTM unit depends on the new cell state.

LSTM Output gate; Source: colah’s blog

First, a sigmoid layer decides what parts of the cell state we’re going to output. Then, a tanh layer is used on the cell state to squash the values between -1 and 1, which is finally multiplied by the sigmoid gate output.

LSTM in action

Now that we have understood the architecture and the components of LSTM, let’s see it in action.

LSTM in action; Source: Medium article

Conclusion

As mentioned in the article, LSTMs can hold information longer by forgetting and remembering information. This is achieved by 4 components — a cell state and 3 gates. It also combats the vanishing gradient problem, which was a limitation with RNNs. This gives LSTMs an edge over vanilla RNNs. We also understood the architecture and working of LSTMs.

If you like it, please leave a 👏.

Feedback/suggestions are always welcome.

--

--