In-depth tutorial of Recurrent Neural Network (RNN) and Long - Short Term Memory (LSTM) Networks

Sanket Maheshwari
Analytics Vidhya
Published in
10 min readMar 28, 2020

Although, research papers are best way to learn about any cutting edge technology, however, it is not very easy to understand them. It has lot of mathematical equations and terminologies which requires lots of effort.

artpal.com

Hence, I will try my best to explain RNN in very structured format but if you face any difficulty, please let me know by writing in comment box at the end. So, are you READY!, Lets’ Start!

facebook.com

What is RNN???

RNN are a class of neural networks that is powerful for modeling sequence data such as time series or natural language. Basically, main idea behind this architecture is to use sequential information.

Do you know how Google’s autocomplete function works???

Basically, collection of large volumes of most frequently occurring consecutive words fed into =======>> RNN network =======>> It analyze the data by finding the sequence of words occurring frequently and builds a model to predict the next word in the sentence.

So, do you see the importance of the RNN in our daily life. Actually, it has made us lazy!

We already have so many problems in life, now why to make it more complex by introducing a new network (RNN), when we already have feed forward neural network?

In feed forward neural network, information flows only in forward direction from the input nodes, through the hidden layers and to the output nodes. There are no cycles or loops in the network.

Feed Forward Neural Network classifying two outputs.

Issues in the feed forward neural network : -

  1. Can’t handle sequential data.
  2. Consider only current input.
  3. Can’t memorize the previous input.
Basic Architecture of RNN

So, from above figure, it is clear that RNN is a special type of feed forward neural network. As explained by the diagram, in RNN, the output of any layer not only depends on the current input but also on the set of inputs that have came before. This special feature provides it a significant advantage over other neural networks by taking help of inputs obtained before to predict outputs at the later stage.

Image showing basic equations of RNN(http://cs231n.stanford.edu)

Applications of RNN : —

  1. Image Captioning — It is used to caption an image by analyzing the activities present in it.
Example of image captioning

2. Time Series Prediction

3. Natural Language Processing — Text mining & Sentiment Analysis

4. Machine Translation — By taking input in one language, RNN can be used to translate it to different languages as output.

Taking English as input and converting it to different languages

Types of RNN architectures : —

Basic RNN architectures (Stanford cs231 lecture)
  1. One to One : — It is also known as vanilla neural network. It is used for basic machine learning problems.
  2. One to Many : — It has single input and many outputs. Application : used in image captioning like, what we saw in dog catching ball in the air previously.
  3. Many to One : — It has many inputs and a single output. This is basically used in analyzing sentiments, where we give a sentence as an input and get sentiment regarding that as a output.
  4. Many to Many : — It takes a sequence of inputs and generate a sequence of outputs. Application : Machine Translation

Issues while training a RNN : —

  1. Vanishing Gradient Problem
  2. Exploding Gradient Problem

The problem arises during training of a deep neural network when the gradients travel in the back-propagation back to the initial layer. As the gradients have to go through continuous matrix multiplication because of the chain rule. Therefore, if they have small values (<1) they shrink exponentially till the time they vanish and this is called vanishing gradient problem. This causes loss of information through time. Moreover, if gradients have large values (>1) they get larger and eventually blow up and crash the model, this is called exploding gradient problem.

Issues due to these problems :

  1. Long training time
  2. Poor Performance
  3. Bad Accuracy

Let’s understand this in more practical way : —

Consider the following two examples to understand what should be the next word in the sequence :

Image : SimpliLearn

In order to understand what would be the next word in the sequence, the RNN must memorize the previous context whether the subject was singular noun or plural noun.

Back Propagation

However,It might be sometimes difficult for the error to back-propagate to the beginning of the sequence to predict what should be the output.

Now, let’s try to understand the above issue in more STATISTICAL way :

Don’t be afraid by seeing the heading. I will try my best to explain all the in-depth concepts in a simple manner.

Suppose, we have an architecture which takes a sentence in English and gives French sentence as an output. So, to achieve this, the architecture has to store as much information in its hidden activations as possible.

This sentence might be 20 words long, this shows that there is a long temporal gap from when it sees an input to when it uses that to make a prediction. It is very tough to learn long-distance relationships. Hence, in order to adjust the input-to-hidden weights based on the first input, requires the error signal to travel backwards through entire path.

Now, the question arises, why this gradient problem doesn’t take place during forward pass ???

Actually, in the forward pass, the activations at each step are put through a non-linear activation function, which typically squashes the values,preventing them from blowing up. Since, the backward pass is entirely linear, there’s nothing to prevent the derivatives from blowing up.

Let’s try to interpret the problem in term’s of the function the RNN computes :

As we know, each layer computer a function of the current input and the previous hidden activations i.e

Basic equations of RNN

by expanding it recursively, we get —

this is iterated function (a function which iterate many times).

Let’s now understand the above concept by taking an example of simple quadratic function : —

Simple Quadratic equation

If we iterate it multiple times, we get some complicated behavior as shown in figure-2 below :

Making it more simple by taking a monotonic example :

monotonic over [0,1]

Now, visually understanding its repeated behavior

Red Line denotes trajectory over which functions iterates

Eventually, the iterates either shoot off to infinity or wind up at a fixed point i.e a point where x = f(x), where the graph of x intersects the dashed line.

Fixed points are of two types : —

  1. Sources : Which repel the iterates (0.82 in the above fig.). It has derivative f’(x) > 1.
  2. Sinks/ Attractors : Which attracts the iterates (0.17 in the above fig.). It has derivative f’(x) < 1.
Phase plot explaining source and sink in a single plot

Because of the source point we have gradient exploding issue and due to sink point we face gradient vanishing issue.

Solutions to the problems of Gradient Explosion and Vanishing Gradient : —

  1. Gradient Clipping :

It helps in preventing gradients from blowing up by re-scaling them, so that their norm is at most a particular value η i.e, if ‖g‖> η, where g is the gradient, we set

By doing this we are introducing bias in the training procedure, since the resulting values won’t actually be the gradient of the cost function. The following figure shows an example with a cliff and a narrow valley; if you happen to land on the face of the cliff, you take a huge step which propels you outside the good region. With gradient clipping, you can stay within the valley.

Image showing how clipping can help in preventing explosion of gradients.

2. Input Reversal :

In the starting we learned, that it is very difficult for the network to learn long-distance dependencies. Therefore, this affect the learning ability of the architecture. This can be fixed by reversing the order of the words in the input sentence, which can be clearly understood by analyzing the below image :

Input Reversal

In the above diagram, we can see that there’s a gap of only one time step between when the first word is read and when it’s needed. This allows the network to learn relationships between the first words. Once it learned this, it can be further trained to learn more difficult dependencies between words later in the sentences.

3. Identity Initialization :

Apart from identity function f(x) = x, which you can iterates many times, other iterated functions can have complex and chaotic behavior. Hence, if a network computes the identity function, the gradient computation will be perfectly stable, since the jacobian is simply the identity matrix.

In the identity RNN architecture all the activation functions are RELU, and the weights are initialized to the identity matrix. Furthermore, the RELU activation function clips the activations to be non-negative, but for non-negative activations it’s equivalent to the identity function.

4. Long-Term Short Term Memory (LSTM):

LSTM is a special kind of RNN which is mainly useful for learning long-term dependencies. The name refers to the idea that the activations of a network correspond to short-term memory, while the weights correspond to long-term memory. If the activations can preserve information over long distances, that makes them long-term short-term memory.

Basic architecture of LSTM

Let’s do the operation of LSTM and understand how it works : —

Basically, LSTM works in three steps :

STEP 1 — Deciding how much of the past it should remember

Example of above step :

Previous output h(t-1) : Alice is good in physics. John on the other hand is good in chemistry.

Current input x(t) : John plays football well. He told me yesterday over phone that he had served as the captain of his football team.

a. Forget gate realizes there might be a change in context after encountering the first full stop.

b. Compares with the current input sentence at x(t).

c. The next sentence talks about John, so the information on Alice is deleted.

d. The position of the subject is vacated and is assigned to John.

STEP 2 — Decide how much should this unit add to the current state.

Example of above step :

Current input x(t) : John plays football well. He told me yesterday over phone that he had served as the captain of his football team.

Input gate analyses the importance of the above sentence as :

[John Plays football and he was the captain of his football team] (more important then) [He told me over the phone](less important and hence it is forgotten)

STEP 3— Decide what part of the current cell makes it to the output.

This example will try to predict the next word in the sentence :

John played tremendously well against the opponent and won his team. For his team contribution, brave ____________ was awarded player of the match.

For human it is very easy to decide what should come here, however, it is not so for the machine. There can be multiple choices for the empty space.

So, in this step system first check what is brave ==>adjective(describes a noun)

Therefore, it decides after going through steps that John should be the best option here.

So, we have discussed various things related to RNNs like what it is, why we need it, its applications, issues faced by the RNN and various ways to resolve it including detail of LSTM. This is a bit long blog but its worth it, as before writing I was sure in mind that I want to write a detailed blog on RNN which cover maximum part of it. After reading all this stuff, I found myself sufficiently equipped to fight a battle, I mean take the dataset and apply all my learnings on it. I feel same for you, therefore, I will write next blog of this series on practical application of RNN. So, if you like the blog please let me know by clapping and commenting if you have any doubt in this.

#machinelearning #deeplearning #neuralnetwork #RNN #LSTM #Recurrent Neural Network #Artificial Intelligence

--

--

Sanket Maheshwari
Analytics Vidhya

Hungry to learn new things. Data Science professional.