Recurrent Neural Network (RNN)

12 min readMay 5, 2024

As we foray into the language based tasks, I would like to draw attention to the below given examples which lay out certain tasks from this domain:

Q&A with a chatbot : Here, a chatbot can be created to ask questions to, which will then be answered by the chatbot based on a given knowledge base.
Language translation : Whenever we need to convert any given part of speech or a sentence/text to another language such that the same meaning and information is retained, we use such tools which can give the same text accurately.
Text generation : Often a times we have seen automatic text completion prompts in our daily usage applications such as Gmail, LinkedIn, MS Teams etc. where the machine automatically suggests text prompts before we type it.

In all these tasks, one thing which is of primal importance is the sequence of words or data, which is the core of determining the output. Even minor changes in the understanding or placement of sequence of the data can be interpreted very differently by the machine which may lead to varied results. Hence, simple machine learning techniques would be inefficient for the task, our reliance on other deep learning architectures increases as we now also have to maintain the order of the data.

In comes the Recurrent Neural Network (RNN) architecture from deep learning which has been used to handle sequential data. As the name suggests, it’s core property lies in the process of a recurring activity. This is the process where after an iteration of input being processed to deliver an output, the same output is then recurrently fed back to the network along the next iteration, so as to retain the previous input while generating the next output. The below given image outlines a basic architectural understanding of the RNN, which is as follows :

Now as we can see above, the RNN works in a recursive or recurrent manner. Once the output is calculated for the a given data point, it is given as output and at the same time, the output is given back to the same neural network to be taken into account while calculating the output for the next data point. So in the above example, the output o1 generated from x1 is given back to the network as the processing for input x2 starts, and so on.

In each iteration, the weights will be initialised and bias will be added to yield the final output, and along those lines we will continue to generate output and pass it on to the next input data point, as well as give that as output.

Now, let us also define the ways in which we can map our data in an RNN from input to output.

One to One

Here, as the name implies, we give only one input and get a singular output, based on the task defined. To visualise the same, we can see below :

An example of where such a mapping can be used is Image Classification. We will only give one image and the network is suppose to answer a specific question or put the image in a specific category based on the problem statement.

One to Many

Here, as the name implies, we give only one input and get multiple outputs, based on the task defined. To visualise the same, we can see below :

An example of where such a mapping can be used is Image Captioning. There, the input will only be of a singular image whereas the network will be expected to provide/generate multiple outputs.

Many to Many

This means that the number of inputs and outputs both can be greater than one. Now here, this also a further split, where the number of input and output may be same, or in the case where they are different.

The case where the input and output are same in number (synchronised) can be seen as shown below :

An example for the above can be found in the tasks of Name Entity Recognition, Part of Speech Identification. There, each word would go as an input and the network would be expected to give the exact type of speech or class does the word belong to.

The case where the input and output are different in number (a-synchronised) can be seen as shown below :

Such a technique can be used in tasks of Machine or Language Translation. Here, a sentence in a certain language may be of a particular length, however as we translate it to another language, the length may increase or decrease given the language.

Now, this is a very interesting case, as such tasks laid the foundation for Encoder Decoder, Transformers, BERT etc, which are today used as the core components of LLMs (Large Language Models) such as ChatGPT, Llamma, Gemini etc.

Many to One

Over here, we simply mean that the inputs are multiple in nature while we expect to get one singular output, for which we can see the below given visualisation:

An example of the above can be in the task of Text Classification. Based on given data, we can train our model to give the output of which class does the data belong to.

Forward Propagation

Now, let us take the example for sentiment analysis, where the words of a sentence will be given to the RNN as different data points, and it will have to yield a final output of positive or negative. To understand this, let us look at the diagram below :

As we can see above, for the first data point, the neuron is calculated as

With this, we get the first output, where w is the initialised weight for the input and b is the bias added to this layer. Now, when we proceed ahead to calculate the next input data point, will do so as follows :

Here, we have taken the previous output into account and added it to the next output with it’s product with w' as well as doing the calculation for the incoming data point.

Finally, once we reach the end with the calculation of the final output, we apply an activation function to yield our final output. Since this is an example of binary classification between two sentiments, we will use sigmoid function. This overall flow is known as the forward propagation.

After this process, we get an output which we can refer as ŷ and from this, we subtract y to achieve the final loss. Now just like the ANN, here we will back propagate to update all the weights and minimise overall loss. We will expand on it in the next section.

Backward Propagation

Now, as we have seen, there also exists the process of back propagating, so as to update the weights and minimising the loss function. So first when we move backwards i.e. from right to left, we encounter w' as the first weight to be updated, for which we use an updation formula. To understand it better, let us have a look at the updation formula of the weight at the last time stamp :

In the above formula, we will be given the learning rate and old w' , however the question arises on how do we find the partial derivative of the loss function (L) with respect to the w' . For that, we can understand the simple chain rule formula, which states :

With the above, we obtain a new w' , following which we find the previous w to be updated i.e. the w at the last time stamp, which again is modified by the similar formula :

where the formula for partial derivate is again :

As we can see, we are moving in a cyclic manner from right to left, where we take into account each of the derivatives of the output so as to update the required w and w'.

Thus, for each output we obtain, we will use it twice to update both the w and w'for that given time stamp, and recursively move back. This process happens for a certain number of epochs or iterations, which we can set to stop early in the case where the loss remains stagnant for a period.

To put it in simple terms, think of it as a domino which moves from right to left as we move back, which then in turn creates a ripple effect for the domino behind it and this process continues recursively while the effects take place in the RNN.

Problem with RNN

However here, we encounter a typical issue when we back propagate. As we see in real world, if our RNN encounters a long textual or time series data, which makes the RNN deeper, we head into the problem of vanishing gradient.

As we have seen above, we are using the derivative function to couple with the learning rate to update the weights. Let us take a classic example of an RNN where the activation function is sigmoid. Here, we are well aware that the value of sigmoid falls between the range of 0 and 1. There is also the fact that it’s derivative will fall between 0 to 0.25. Now, as we keep on moving from right to left, our final gradient will keep on converging to 0, which does not update the weights in any way than a marginal change. Thus, this leads to our gradient being vanished over the course of time.

To tackle this issue, there are some variants of the RNN which we use. As we expand more into the world of RNNs, there are two more important type of variants of RNN which are used heavily in the world of generative AI, which are as described ahead.

Long Short Term Memory (LSTM)

This variant of the RNN has signifiant importance in the field of NLP tasks, as it allows to retain information from the previous input data points, over the course of time. Let us take an example for better understanding. Suppose we have the below given sentence :

My name is Dhruv. I am a student of Indian Institute of Technology Madras and have a keen passion for studying artificial intelligence and machine learning, along with a keen passion to read and watch movies. My favourite genre to watch is action and I enjoy reading non fiction. I am a student of the Data Sciences department in my university.

So here, if we were to ask that what is the department I am mentioning is in the context of or my favourite genre to ask is in which context, we would clearly know the answer by seeing the entire paragraph. However, it is important for the machine to understand as well the context in which an input is being referenced, as it may be from a nearby data point or a far earlier data point in the input stream.

So, that is the core purpose of an LSTM, as it allows the machine to retain the context of the entire conversation.

The core components of an LSTM are as follows :

Memory Cell
Forget Cell
Input Cell

To understand this in brief, let us reference the image shown below :

As we can see above, a line passes on top of each cell which is referenced as C, which is the context line. The first operation it encounters is the X or the pointwise operator. The main idea of this operation is to forget any information which is not required, and how many things does the cell need to remember in the given context. This is usually called the memory cell of the LSTM, as it decides the exact decision and quantum of the information to be retained in the present iteration.

The next operation we encounter is + operation, which adds any new information to the present iteration, which the cell has received from the current data point. The new context information which we have received is from the concatenation of the previous data cell as well as the current data point ingested into the LSTM. This can be interpreted as the inputlayer of the LSTM.

Finally, the LSTM normalises the previous information and passes it on to the context line and yields the final output. So basically, it retains the new information which is to be kept and passes it on to the memory of the next LSTM cell, where this information is passed as an input context. Hence, it acts as the memory cell of the LSTM.

To conclude, we prefer LSTM RNN over RNNs because of :

To tackle the vanishing gradient descent problem in weight updation
To retain context in the case of large and deep RNN networks and/or huge dataset.

Gated Recurrent Unit (GRU)

Now, the Gated Recurrent Unit (GRU) is a relatively newer version of the LSTM RNN, and this was introduced in 2014. The basic difference between the two is that instead of having two gates or stream for both long and short term memory, the GRU architecture combines both of them to create a single hidden state which retains both the information, which is as shown below :

Now, let us go ahead and visualise how the inside of a GRU cell looks exactly :

In the above image, we can see primarily two gates, which are as follows :

Update Gate
Reset Gate

As their names suggest, the function of the update gate is to make the decision of what to retain and the quantum of the retained information from the previous cell. While on the other hand, the reset gate acts on the information which the cell is supposed to forget and it’s quantity.

It would be correct to say that it is a lightweight version of the previously defined LSTM RNN, as it has lesser number of gates which in turn makes it computationally more efficient. This is one of the reasons for it’s recent popularity.

Recurrent Neural Networks (RNNs) have revolutionised how we handle sequential data, opening doors to remarkable advancements in language processing tasks like Q&A with chatbots, language translation, and text generation. Their ability to retain information and handle sequence-based problems makes them invaluable for tasks where order and context are crucial.

However, RNNs face the challenge of vanishing gradients when dealing with long sequences. To address this, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures have emerged, offering solutions to retain context and tackle vanishing gradients. To learn more about RNN, LSTM and GRU unit in depth, you can refer the paper here.

With the use of the above, there was a success in performing natural language processing tasks, as they could now retain information from both the long term as well as the short term. However, one factor which hindered it to scale further was the fact that it processed data sequentially.

That means, it took input at successive iterations and then build on the model, which may not be the most suitable approach when dealing with large amounts of data. Hence, to tackle this required a model which could process data parallelly so as to retain information on huge data set, which eventually led to the creation and integration of Transformers,BERT and GPT for such tasks.

Credits

I would like to take the opportunity to thank Krish Naik for his series in deep learning on his Youtube channel, which has allowed me to learn and present the above article. You can check out his Youtube channel here. Thanks for reading!