Beginner’s Guide to RNN & LSTMs

Dinesh
13 min readDec 5, 2019
A recurrent neuron, where the output data is multiplied by a weight and fed back into the input

What is RNN?

Recurrent Neural Network is basically a generalization of feed-forward neural network that has an internal memory. RNNs are a special kind of neural networks that are designed to effectively deal with sequential data. This kind of data includes time series (a list of values of some parameters over a certain period of time) text documents, which can be seen as a sequence of words, or audio, which can be seen as a sequence of sound frequencies over time.
RNN is recurrent in nature as it performs the same function for every input of data while the output of the current input depends on the past one computation. For making a decision, it considers the current input and the output that it has learned from the previous input.

Cells that are a function of inputs from previous time steps are also known as memory cells.

Unlike feed-forward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other.

Why RNN?

The basic challenge of classic feed-forward neural network is that it has no memory, that is, each training example given as input to the model is treated independent of each other. In order to work with sequential data with such models — you need to show them the entire sequence in one go as one training example. This is problematic because number of words in a sentence could vary and more importantly this is not how we tend to process a sentence in our head.

When we read a sentence, we read it word by word, keep the prior words / context in memory and then update our understanding based on the new words which we incrementally read to understand the whole sentence. This is the basic idea behind the RNNs — they iterate through the elements of input sequence while maintaining a internal “state”, which encodes everything which it has seen so far. The “state” of the RNN is reset when processing two different and independent sequences.

Recurrent neural networks are a special type of neural network where the outputs from previous time steps are fed as input to the current time step.

Basic Recurrent neural network with three input nodes

The way RNNs do this, is by taking the output of each neuron (input nodes are fed into a hidden layer with sigmoid or tanh activations), and feeding it back to it as an input. By doing this, it does not only receive new pieces of information in every time step, but it also adds to these new pieces of information a w̲e̲i̲g̲h̲t̲e̲d̲ ̲v̲e̲r̲s̲i̲o̲n̲ of the previous output. As you can see the hidden layer outputs are passed through a conceptual delay block to allow the input of h ᵗ⁻¹ into the hidden layer. What is the point of this? Simply, the point is that we can now model time or sequence-dependent data.
This makes these neurons have a kind of “memory of the previous inputs it has had, as they are somehow quantified by the output being fed back to the neuron.

A particularly good example of this is in predicting text sequences. Consider the following text string: “A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly { }”. There are many options for what could fill in the { } symbol in the above string, for instance, “miss”, “ma’am” and so on. However, other words could also fit, such as “sir”, “Mister” etc. In order to get the correct gender of the noun, the neural network needs to “recall” that two previous words designating the likely gender (i.e. “girl” and “she”) were used.
A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. This type of flow of information through time (or sequence) in a recurrent neural network is shown in the diagram below, which unrolls the sequence (loop unrolled):

Unrolled recurrent neural network

This unrolled network shows how we can supply a stream of data (intimately related to sequences, lists and time-series data) to the recurrent neural network. For instance, first we supply the word vector for “A” to the network F — the output of the nodes in F are fed into the “next” network and also act as a stand-alone output ( h₀ ). The next network (though it’s the same network) F at time t=1 takes the next word vector for “girl” and the previous output h₀ into its hidden nodes, producing the next output h₁ and so on.

NOTE: Although shown for easy explanation in Diagram, but the words themselves i.e. “A”, “girl” etc. aren’t inputted directly into the neural network. Neither are their one-hot vector type representations — rather, an embedding word vector (read Word2Vec) is used for each word.

One last thing to note — the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step (it’s the same for all time steps BECAUSE we have the same single RNN cell looped to itself)

Different types of RNN’s

The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors: Sequences in the input, the output, or in the most general case both. A few examples may make this more concrete:

RNN Implementation types

The problem with RNNs is that as time passes by and they get fed more and more new data, they start to “forget about the previous data they have seen (vanishing gradient problem), as it gets diluted between the new data, the transformation from activation function, and the weight multiplication. This means they have a good short term memory, but a slight problem when trying to remember things that have happened a while ago (data they have seen many time steps in the past).
The more time steps we have, the more chance we have of back-propagation gradients either accumulating and exploding or vanishing down to nothing.
Consider the following representation of a recurrent neural network:

Here, ht is the new state (current time stamp), ht₋₁is the previous state (previous time stamp) while xₜ is the current input.
Where U and V are the weight matrices connecting the inputs and the recurrent outputs respectively. We then often will perform a softmax of all hₜ the outputs. Notice, however, that if we go back three time steps in our recurrent neural network, we have the following:

From the above you can see, as we work our way back in time, we are essentially adding deeper and deeper layers to our network. This causes a problem — consider the gradient of the error with respect to the weight matrix U during back-propagation through time, it looks something like this:

The equation above is only a rough approximation of what is going on during back-propagation through time. Each of these gradients will involve calculating the gradient of the sigmoid function. The problem with the sigmoid function occurs when the input values are such that the output is close to either 0 or 1 — at this point, the gradient is very small (saturating).

For ex:- Lets say the value decreased like 0.863 →0.532 →0.356 →0.192 →0.117 →0.086 →0.023 →0.019..
you can see that there is no much change in last 3 iterations.

It means that when you multiply many sigmoid gradients together you are multiplying many values which are potentially much less than zero — this leads to a vanishing gradient problem.

Gradients shrink as it back-propagates through time

The gradient values will exponentially shrink as it propagates through each time step. Because the gradient will become basically zero when dealing with many prior time steps, the weights won’t adjust to take into account these values, and therefore the network won’t learn any relationships separated by a long significant periods of time. So, Vanishing gradient problem results in long-term dependencies being ignored during training.

You Can Visualize this Vanishing gradient problem at real time here.

Hence, the RNN doesn’t learn the long-range dependencies across time steps. This makes them not much useful.

We need some sort of Long term memory, which is just what LSTMs provide.

Enhancing our memory — Long Short Term Memory Networks (LSTM)

Long-Short Term Memory networks or LSTMs are a variant of RNN that solve the Long term memory problem of the former.

They have a more complex cell structure than a normal recurrent neuron, that allows them to better regulate how to learn or forget efficiently from the different input sources.

The key to LSTMs is the cell state (cell memory), the horizontal line running through the top of the diagram, through which the information flows along and the internal mechanism called gates that can regulate the flow of information.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions.

Cell State basically encodes the information of the inputs (relevant info.) that have been observed up to that step (at every step).

Representation of an LSTM cell

Cell state is a memory of the LSTM cell and hidden state (cell output) is an output of this cell.

Cells do have internal cell state, often abbreviated as “c”, and cells output is what is called a “hidden state”, abbreviated as “h”.
Regular RNNs have just the hidden state and no cell state. Therefore, RNNs have difficulty of accessing information from a long time ago.

Note: Hidden state is an output of the LSTM cell, used for Prediction. It contains the information of previous inputs (from cell state/memory) along with current input (decided according which context is important).

Hidden state (hₜ ₋ ₁) and cell input (xₜ) data is used to control what to do with memory (cell state) cₜ : to forget or to write new information.

We decide what to do with memory knowing about previous cell output (hidden state) and current input and we do this using gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a point-wise multiplication operation.

LSTM Gate

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.
A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions.

An LSTM neuron can do this learning by incorporating a cell state and three different gates: the input gate, the forget gate and the output gate. In each time step, the cell can decide what to do with the state vector: read from it, write to it, or delete it, thanks to an explicit gating mechanism.
With the input gate, the cell can decide whether to update the cell state or not. With the forget gate the cell can erase its memory, and with the output gate the cell can decide whether to make the output information available or not.

LSTMs also mitigate the problems of exploding and vanishing gradients.

To reduce the vanishing (and exploding) gradient problem, and therefore allow deeper networks and recurrent neural networks to perform well in practical settings, there needs to be a way to reduce the multiplication of gradients which are less than zero.

The LSTM cell is a specifically designed unit of logic that will help reduce the vanishing gradient problem sufficiently to make recurrent neural networks more useful for long-term memory tasks i.e. text sequence predictions.
The way it does so is by creating an internal memory state which is simply added to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a forget gate, which determines which states are remembered or forgotten. Two other gates, the input gate and output gate, are also featured in LSTM cells.

Here’s a brief summary of all the internal formulation and working of different gates,cell state,hidden state and current input, explained through mathematical formulas, referenced from a research paper https://arxiv.org/abs/1603.03827 (~LSTM for text classification):

LSTM summarized

Let’s first have a look at LSTM cell more carefully.

LSTM cell another view

The data flow is from left-to-right in the diagram above, with the current input xₜ and the previous cell output hₜ₋₁ concatenated together and entering the top “data rail”. The long-term memory is usually called the cell state Ct. The looping arrows indicate recursive nature of the cell. This allows information from previous intervals to be stored within the LSTM cell. Here’s where things get interesting.

Input Gate:

The input gate is also called the save vector.
These gates determine which information should enter the cell state / long-term memory OR which information should be saved to the cell state or should be forgotten.

First, the (combined) input is squashed between -1 and 1 using a tanh activation function.
This squashed input (from tanh) is then multiplied element-wise by the output of the input gate. The input gate is basically a hidden layer of sigmoid activated nodes, with weighted xₜ and input values hₜ ₋ ₁, which outputs values of between 0 and 1 and when multiplied element-wise by the input determines which inputs are switched on and off (actually, the values aren’t binary, they are a continuous values between 0 & 1). In other words, it is a kind of input filter or gate (it tells what to learn and add to the memory from current input and the context its given and also how much {sigmoid gives values between 0&1} of what to learn).

Simplistic (could be wrong) view: Tanh gives the Standardized (between -1 & 1) value of the actual unscaled (combined) input vector and the sigmoid layer is the controller of what percentage (values between 0 & 1 or 0 to 100%) of what inputs should be passed on of the scaled (from tanh) values to be added to the memory considering the current and previous context.

But Why tanh activation?

Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero / turned-off / forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget certain memories.

Forget Gate:

The forget gate is also called the remember vector. The output of the forget gate tells the cell state which information to forget by multiplying 0 to a position in the matrix. If the output of the forget gate is 1, the information is kept in the cell state.

Although initially it is randomly initialized, but it basically LEARNS What exactly to FORGET (when the current input and previous Context is given) from the memory (cell state).

Output Gate:

The output gate is also called the focus vector.
It basically highlights, out of all the possible values from the matrix(long memory), which information should be moving forward to the next hidden state.

Note: The working memory is usually called the hidden state(ht).
It is basically → ht (LSTM OUTPUT) → What part of the existing memory (Ct) should be fed as context for the next round. This is analogous to the hidden state in RNN and HMM.

Gates Summarized:
Input gate determines the extent to which the current timestamp input should be used , the Forget gate determines the extent to which output of the previous timestamp state should be used, and the Output gate determines the output of the current timestamp.

Still Unclear? Doubts? Then I’ll highly recommend you to watch this amazing video by Brandon Rohrer on RNN & LSTM (A Must Watch):

Conclusion

The idea is how the learning process is based on context (memory).
You forget, you learn and you extract a part of it for the next round, but in next round you again repeat the same process.

Basically, we are trying to mimic how human brain tries to learn things through LSTM internal gate mechanism (& this may not be necessarily true, we’re just trying different approaches)

GRU — Gated Recurrent Unit

The GRU introduced in 2014 by Kyunghyun Cho et al was a simplified version of LSTM with just two gates instead of 3 and with a far fewer parameters vs LSTM.

GRU have been shown to have better performance on certain tasks and smaller datasets, but LSTM tend to outperform GRU in general.

There were some additional variants like bidirectional LSTM which don’t just process text left to right, but also do it right to left to give additional performance boost.

WHAT DOES THE CONTEXT CONTAIN:

  • We extract only a part of our memory that is relevant at any point.
  • We combine it with current surroundings.
  • We then forget parts of this total that is not required or might confuse.
  • We learn new stuff from this total (current and our past) and update our memory.

--

--