Introduction to Recurrent Neural Network

9 min readJan 21, 2023

Please forget about Recurrent Neural Network for now! If I ask you what a Neural Network is? Will you be able to answer? Getting into Deep Learning algorithms is a good thing, but what is more important is that you should have your basic clears. Please go through the Neural Network tutorial (Blog), if you have not done so already.

Once you have gone through the mentioned link, let’s jump back to the topic.

What is a Recurrent Neural Network?

Simply put, Recurrent Neural Networks (RNN) is a class of Artificial Neural Networks.

What differentiates it from a traditional Neural Network?

In a traditional Neural Network all inputs (and outputs) are assumed to be independent of each other, this is not the case with Recurrent Neural Network. In Recurrent Neural Network input or output are dependent.

Why do we need input or output to be dependent?

Consider an example where you want to predict the next word in a sentence:

“Tasha lives in India. She speaks fluently ……”

What will make a good prediction is if you better know “She” is related to Tasha and the country she lives in is India. Given this context, the suitable word seems to me to be “Hindi” or “English” or any other vernacular language. If you don’t know the first sentence (Tasha lives in India), it would be difficult to predict the word “Hindi”, isn’t it?

Why are they called Recurrent Networks?

RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. We can assume RNNs as Neural Networks having a “memory” which captures information about what has been calculated so far.

Fig 1 — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Backpropagation vs. Backpropagation Through Time:

To be honest, I do not see any difference between BackPropagation and Backpropagation Through Time, since both of these use same underlying algorithm (i.e., the chain rule applied to the underlying computation graph, or so to say a neural architecture, to calculate gradients of a loss function with respect to parts of the graph, especially parameters).

The reason it is called “Through Time Back-Propagation” is just to signify that this algorithm is being applied to a temporal neural model (Recurrent Neural Network or RNN) and nothing else.

What happens in a RNN is, we unfold an RNN over so many time steps or elements in a sequence (shared parameters over each step) to create one very deep (in time) Neural Network. You can think of it in this way, we unfold it over a variable number of time steps (according to the number of elements that come before the target to be predicted). This unfolding procedure essentially is what “BackPropagation through time” refers to; otherwise, back-propagation through time is effectively applying classical back-propagation of errors to RNNs.

Fig 2 — https://dennybritz.com/posts/wildml/recurrent-neural-networks-tutorial-part-3/

Unfolding RNN:

Fig 3 — https://dennybritz.com/posts/wildml/recurrent-neural-networks-tutorial-part-1/

Let’s’ say, if the sequence we are talking about is a sentence of 5 words, the network would be unrolled into a 5-layer Neural Network, one layer for each word. The formulas that govern the computation happening in a RNN are as follows:

Xt is input at time step t.
St is the hidden state at time step t. It’s the “memory” of the network. It is calculated based on the previous hidden state and the input at the current step.
Ot is the output at step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary.

In short, the main feature of an RNN is its hidden state, which captures some information about a sequence and uses it accordingly whenever needed.

Vanishing/Exploding gradient problem:

In theory RNN can handle a large sequence very effectively, but unfortunately it is not the case once we start applying it. In case of multiple layers first layer will map a large input region to a smaller output region, which will be mapped to an even smaller region by the second layer, and so on. As a result, even a large change in the parameters of the first layer doesn’t change the output much. If a change in the parameter’s value causes very small change in the network’s output — the network just can’t learn the parameter effectively; this is the problem (vanishing gradient problem).

One can think of it in this way:

As we have discussed earlier, a Recurrent Neural Network performs a transformation to its state at each time step. Now, since the network repeatedly uses the same weight matrix, applied transformation is same at each time step. Since the applied inverse transformations are coupled (related), and thus either scaling up or scaling down happens. Therefore when, the same inverse transformation is applied to the loss. This makes it much more likely for the gradients to vanish or explode.

One way to deal with this problem is to encourage the transformation that is applied to the states to roughly preserve the scale. That’s why LSTMs (we will talk about it shortly) compute the next context state by multiplying the previous context state with the forget gate, a scalar very close to 1.

Fig 4a — https://www.researchgate.net/figure/The-vanishing-gradient-problem-for-RNNs-The-picture-is-reproduced-from-5_fig1_325886727

LSTMs and GRUs:

To solve the problem of vanishing gradient we use modified versions of RNNs- Gated Recurrent Unit (GRU) and long short term memory (LSTM). The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. The GRU unit on the other hand controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.

Fig 5 — https://www.frontiersin.org/articles/10.3389/frai.2020.00040/full

Let’s talk about them in detail.

Understanding LSTMs:

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a point wise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates (an “input” gate controls the extent to which a new value flows into the memory, a “forget” gate controls the extent to which a value remains in memory, an “output” gate controls the extent to which the value in memory is used to compute the output activation of the block) to protect and control the cell state (information flows along it).

Step by step walkthrough LSTM:

The first step in LSTM is to decide what information you are going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It gives a value between 0 and 1, where a 1 represents “keep this as it is” while a 0 represents “get rid of this.”

Fig 7 — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Next, we have to decide what new information we’re going to store in the cell state. This step has two parts first; a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values that could be added to the state. In the next step, by combining these two new update is being created.

Fig 8 — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The last steps has already crated an update, we just need to actually update it.

Fig 9 — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Finally, we need to decide what we’re going to output based on the context that we have selected.

Fig 10 — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

This is it as far as LSTM is concerned. Today a large number of people use the LSTM instead of the basic RNN and they work tremendously well on a diverse set of problems. Most remarkable results are achieved with LSTM instead of RNN and now phenomenon has extended to such a level that when someone is talking or using RNN, he actually means LSTM.

Understanding GRUs:

There are large number of variations of LSTM are used today. One such reasonable variation on the LSTM is the Gated Recurrent Unit, or GRU. It combines the forget and input gates into a single “update gate”. It also merges the cell state and hidden state, and makes some other changes in the way the output is given. The resulting model is simpler than standard LSTM models, and has been quite well received in data science community.

It has been observed that LSTM works better for large number of dataset while GRU works better for small number of dataset. However, there is no hard and fast rule as such.

Fig 11 — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Limitations of RNN:

We already had observed that a simple RNN struggle through problem of Vanishing gradient, reason why LSTM is being introduced. Now, is there limitation to LSTM?

The answer is a loud yes.

Apart from the fact that it is quite complex to understand at first place (not a genuine disadvantage though), it is slower than other normal models. With careful initialization and training simple RNN itself can perform as per with LSTM, with less computational complexity. When recent information is more important than old information, no doubt that the LSTM model is anytime a better choice, but there are problems where you want to go into deep past, in such cases a new mechanism called “attention mechanism” is growing popular- a slight modified version of this model is called “Recurrent weighted average network”. We will discuss this in detail may be some other time.

Future of Recurrent Neural Network:

One more shortcoming of conventional LSTMs is that they are only able to make use of previous context. There is new variation becoming quite popular is Bidirectional RNNs (BRNNs). They process data into both directions using two separate hidden layers; combining these two will give complete information about context. BRNNs have been used in speech recognition models quite successfully.

The sequence-to-sequence LSTM, also called encoder-decoder LSTMs (Combination of two LSTMs), are an application of LSTMs that are receiving a lot of attention given their impressive capability in Question-Answer models (Chatbot).

Time series prediction and anomaly detection is another area where RNN (LSTM) seems quite promising. Given these wise range of problem sets where RNN can be applied quite effectively, the future of RNN seems quite bright, is not it?

This was a long tutorial, and if you have reached so far, you should be already feeling tired. So, I am stopping at this point. You should remember that if you want to work in the field of “Natural language processing” or otherwise also, it becomes almost a must to learn Recurrent Neural Networks.

There are few equations along with the pictorial representations, which could be ignored for now. We will be covering all in the implementation and future blogs.

Follow NeeshAi to learn more about Data Science, Product Management and more in Tech!

With the promise to come up with next article very soon, Adios!

References:

Introduction to Recurrent Neural Network

Written by Neesh AI