Deep Learning

Understanding Recurrent Neural Networks

An introduction to Recurrent Neural Networks

NVS Yashwanth
Analytics Vidhya
Published in
7 min readAug 25, 2020

--

A Recurrent Neural Network(RNN) in its Folded and Unfolded model. Source [1].

What is a Recurrent Neural Network (RNN)?

The first time I came across RNNs, I was completely baffled. How can a network even remember things? Recurrent Neural Networks have proved to be effective and popular for processing sequential data ever since the first time they emerged in the late 1980s.

Recurrent Neural Networks have been derived from vanilla Feed Forward Neural Networks. They have so-called memory elements that help the network remember previous outputs.

Image from Source Unsplash.

So why the word Recurrent?

They are Recurrent because they repeatedly perform the same task for every element in the sequence, with the output being dependent on the previous computations.

Recurrent Neural Networks (RNNs) have been a huge improvement over the vanilla neural network. A typical vanilla neural network calculates an output on the current input and weights with a limitation of predetermined fixed input size.

Vanilla Neural Network: Feed Forward Neural Network. Source NNDL [2].

In this article, we will go over the architecture of RNNs, with just enough math by taking the example of Elman Network.

Why RNNs?

In typical Neural networks, the output is based only on the current input. None of the previous outputs are considered when generating the current output. There are no memory elements. In cases where we require the same, RNNs are useful.

RNNs are designed to take a series of inputs with no predetermined limit on size.

Why past outputs?

Most of the applications have temporal dependencies. Meaning output generated not only depends on the current input but also on the previous outputs.

RNNs are useful in speech recognition (Alexa, google assistant, etc.), time-series predictions (Stock market, weather forecast), Natural Language Processing (NLP), etc.

RNNs have the ability to capture temporal dependencies over time.

Deep dive into RNNs

You might be wondering, okay but how are these networks able to do all the remembering? Well, let us discuss the same now.

RNNs have sequences as inputs in the training phase and have memory elements which are basically the output of the hidden layers. These so-called memory elements serve as the input during the next training step.

Elman Network is the most basic three-layer neural network with feedback that serves as memory inputs. Don’t get overwhelmed by the notations. We will go over it in a while.

Elman Network. Source Wikipedia [3].

In FFNN(Feed Forward Neural Networks) output at time t, is a function of the current input and the weights. This can be easily expressed as follows :

Output of FFNN. Source Udacity. [5]

The hidden layer output can be represented with an activation function Φ as follows :

Hidden layer Outputs with activation function in FFNN. Source Udacity. [5]

When it comes to activation functions, the following are most used with RNNs :

Activation functions. Source [4]

However, in RNN(Recurrent Neural Network), output at time t is a function of the current input, weights as well as the previous inputs. This can be easily expressed as follows :

Output of RNN. Source Udacity. [5]

RNN Folded and Unfolded Model

Let us understand the architecture and the math behind these networks. In RNN we have input layers, state layers, and output layers. These state layers are similar to hidden layers in FFNN, but they have the ability to capture temporal dependencies or say the previous inputs for the network.

RNN Folded Model. Source Udacity. [5]
RNN Unfolded Model. Source Udacity. [5]

The unfolded model is usually what we use when working with RNNs.

In the pictures above, x¯(x bar) represents the input vector, y¯​(y bar) represents the output vector and s¯(s bar) denotes the state vector.

Wx​ is the weight matrix connecting the inputs to the state layer.

Wy​ is the weight matrix connecting the state layer to the output layer.

Ws​ represents the weight matrix connecting the state from the previous timestep to the state in the current timestep.

The so-called state layers output can be given as :

Output of State layer. Source Udacity. [5]

The output layer (with softmax function) can be given as :

Output of RNN. Source Udacity. [5]

The reason why unfolded models are commonly used is we can easily visualize the same for better understanding. Let us look at the folded and unfolded Elman Network.

Elman Network Folded model at time t. Source Udacity. [5]
Elman Network Unfolded model at time t. Source Udacity. [5]

The folded Elman Network at time t, with output y1,y2.

The memory elements are represented by state layers. The real issue with the folded model is we can’t visualize more than a single time instance at a time.

The unfolded model gives a clear picture of the input sequences, state, and output layers at a time say T(zero) to time Tn, which is over a period of time. For example, Yt+2 is determined by Wy and St+1 and Xt+2 with corresponding weights Wy and Wx.

Backpropagation Through Time (BPTT)

We can now look at how the network learns. It is similar to that in FFNN, with the exception that we need to consider previous time steps, as the system has memory. RNNs use Backpropagation Through Time (BPTT).

To simplify things, let us consider a Loss function as follows :

Loss or error function. Source Udacity. [5]

Et​ represents the output error at time t

dt​ represents the desired output at time t

yt​ represents the calculated output at time t

In BPTT we calculate the gradient to optimize the weights of Wy, Ws and Wx.

For Wy, the change in weights at time N, can be calculated in one step as follows :

Error w.r.t y, for time N. Source Udacity. [5]

For Ws gradient is accumulated over time each state. Thus say at time t=3, we consider gradient from t=1 to t=3 and apply chain rule considering s1​¯​(s1 bar) to s2​¯​(s2 bar)​ and s3​¯​(s3 bar) as follows :

Error w.r.t S, for time t=3. Source Udacity. [5]
Error w.r.t S, for time N. Source Udacity. [5]

Error due to Weights Wx is calculated similarly.

Error w.r.t X, for time t=3. Source Udacity. [5]
Error w.r.t X, for time N. Source Udacity. [5]

Drawbacks of RNN

If we backpropagate more than ~10 timesteps, the gradient will become too small. This phenomenon is known as the vanishing gradient problem. Therefore temporal dependencies that span many time steps will effectively be discarded by the network. The reason why this happens is that it is difficult to capture long term dependencies because of the multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.

In RNNs we can also have the opposite problem, called the exploding gradient problem, in which the value of the gradient grows uncontrollably. A simple solution for the exploding gradient problem is Gradient Clipping. By capping the maximum value for the gradient, this phenomenon is controlled in practice.

Conclusions

I hope you got a basic understanding of how RNNs work. RNNs have further been improved by so-called Long Short-Term Memory Cells (LSTM) as a solution to the vanishing gradient problem, by helping us capture temporal dependencies over 10 timesteps and even 1000! The LSTM cell is a bit more complicated and the same would be covered in another article.

Hey, if you liked this article please show your support by smashing that clap button and sharing this article. Follow me for more articles on Machine Learning, Deep Learning, and Data Science.

Find me around the web

GitHub Profile: This is where I fork

LinkedIn Profile: Connecting and sharing professional updates

Twitter: Sharing tech tweets

Thank you :)

--

--

NVS Yashwanth
Analytics Vidhya

A decision analyst with a keen interest in AI. When I don’t code, I try to understand the math behind neural nets.