Recurrent Neural Network and it’s variants….

Shujaat Hasan
Analytics Vidhya
Published in
7 min readJun 26, 2020
RNN and it’s variants. (dProgrammer lopez)

Humans when read understand each word based on their previous word understanding. We don’t forget all the information and start thinking from scratch. In the same way RNN’s work, by preserving the previous time information and predicting the next.

In this post we will start by understanding what RNN is and how it works. We will discuss about LSTM and GRU, which are other variants of RNN. We will understand what are the problems with standard RNN like: exploding gradients, vanishing gradients etc and how to overcome all this.

Standard Recurrent Neural Network:

Typical RNN block

Lets first understand a single block of RNN. As shown above in the figure we have a single block of RNN. Unlike neural networks RNN has two inputs xt and x(t-1), where xt is the current time step input which can be a word of sentence, character of words, hertz of an audio etc. x(t-1) is previous time step input which is nothing but previous RNN block activation. Each input have their own weights, here we have two input with weights Wx for xt and Wa for x(t-1). Here we have the following two equation for the block discussed above:

equation 1: a(t) = tanh(Wx.x(t) + Wa.x(t-1) + ba)
equation 2: y(t) = softmax(Wy.a(t) + by)

As can be seen above in equation 1, we are summing up dot products of weights & inputs at two different time steps and adding bias. The linear equation is then passed through an activation function ‘tanh’ to squash the values between -1 and 1 which gives a(t). In the second equation the activation output a(t) is passed to softmax function as shown.

Now we understand the single block operation of an RNN, combining these blocks will give us an RNN model shown below:

An RNN Model

To calculate the loss of the model, individual block losses are calculated and summed up to get the total loss. Following equation represent the loss of RNN:

Backpropagation: Backpropagation for RNN is done at each point in time. At time step ‘T’, the derivative of the loss L with respect to weight matrix ‘W’ is expressed as:

Now as we have understand what RNN is and how it works, lets look what are the problems with these models and how to overcome on these. One of the problem with RNN is exploding/vanishing gradients during backpropagation step, the reason it happen because RNN’S are limited to look back in time for approximately ten time-steps. As going backward in time during back signal gradients become so small which results in vanishing gradients.

For solving the problem of vanishing gradient, different gates are used with the RNN which give us some different versions: LSTM and GRU. Let’s discuss these two:

LSTM:

To understand LSTM lets have look on the block diagram:

Source

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram above.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

LSTM has three different gates forget gate, input gate and output gate. Each of them is discussed below:

Forget Gate

A forget gate is responsible for removing information from the cell state. The information that is no longer required for the LSTM to understand things or the information that is of less importance is removed via multiplication of a filter. This is required for optimizing the performance of the LSTM network.

This takes two inputs; h_t-1 and x_t.

h_t-1 is the hidden state from the previous cell or the output of the previous cell and x_t is the input at that particular time step. The given inputs are multiplied by the weight matrices and a bias is added. Following this, the sigmoid function is applied to this value. The sigmoid function outputs a vector, with values ranging from 0 to 1, corresponding to each number in the cell state. Basically, the sigmoid function is responsible for deciding which values to keep and which to discard. If a ‘0’ is output for a particular value in the cell state, it means that the forget gate wants the cell state to forget that piece of information completely. Similarly, a ‘1’ means that the forget gate wants to remember that entire piece of information. This vector output from the sigmoid function is multiplied to the cell state.

Input Gate

The input gate is responsible for the addition of information to the cell state. This addition of information is basically three-step process as seen from the diagram above.

  1. Regulating what values need to be added to the cell state by involving a sigmoid function. This is basically very similar to the forget gate and acts as a filter for all the information from h_t-1 and x_t.
  2. Creating a vector containing all possible values that can be added (as perceived from h_t-1 and x_t) to the cell state. This is done using the tanh function, which outputs values from -1 to +1.
  3. Multiplying the value of the regulatory filter (the sigmoid gate) to the created vector (the tanh function) and then adding this useful information to the cell state via addition operation.

Once this three-step process is done with, we ensure that only that information is added to the cell state that is important and is not redundant.

Output Gate

The functioning of an output gate can again be broken down to three steps:

  1. Creating a vector after applying tanh function to the cell state, thereby scaling the values to the range -1 to +1.
  2. Making a filter using the values of h_t-1 and x_t, such that it can regulate the values that need to be output from the vector created above. This filter again employs a sigmoid function.
  3. Multiplying the value of this regulatory filter to the vector created in step 1, and sending it out as a output and also to the hidden state of the next cell.

GRU:

GRU are a variation on LSTM recurrent neural networks.

Here GRU network has a reset and an update “gate” that helps ensure its memory doesn’t get taken over by tracking short term dependencies. The network learns how to use its gates to protect its memory so that it’s able to make longer term predictions.

Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around. If we set the reset to all 1’s and update gate to all 0’s we again arrive at our plain RNN model. The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM, but there are a few key differences:

  • A GRU has two gates, an LSTM has three gates.
  • GRUs don’t possess and internal memory (ct) that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs.
  • The input and forget gates are coupled by an update gate z and the reset gate r is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both z and r.
  • We don’t apply a second non-linearity when computing the output.

Please share, clap if you find post helpful.

Sources:

--

--

Shujaat Hasan
Analytics Vidhya

Computer Vision Engineer | Fitness addict | Write about Data, python, algorithims, deep learning, fitness | New article (nearly) every month!