RECURRENT NEURAL NETWORK | CELL STATE | LEARNING |

Recurrent Neural Network: Part 1

Understanding use cases, basics, and introduction about RNN with use cases

Chinmay Bhalerao

Published in

Data And Beyond

9 min readJun 15, 2023

After writing about LangChain, LLM, and vision transformers, people will ask me “Why you are going back for RNN”. Partially they are right, but my thoughts are, RNN is the first model for the effective processing of sequential data before any LLM or transformers. So it's important to see how we advanced in this race whose starting point was RNN.

There are many applications in the market that use RNNs for processing sequential data. Although we have attention networks or transformers now, RNN used to be a prominent candidate to work with sequential data. It might assume that if you are working with LLM then you must have a strong grasp of RNN. Let's see RNN in detail.

What is RNN?

RNN stands for RECURRENT NEURAL NETWORK. The word RECURRENT has a very proper meaning; returning or happening time after time. Our RNN work on time stamps and the same network work again and again [You will see how perfect the meaning is! at the end of this blog! So stay tuned!]

RNN is a type of neural network that can remember things. It does this by having connections between its nodes that loop back around to the same node. This allows the network to keep track of what it has seen or heard in the past, which can be helpful for tasks like machine translation or text generation.

then a question arises,

Why we are not using ANN or CNN?

The answer lies in the type of data that we are using to process.

Sequential data: Sequential data is a type of data where the order of the data points matters. This means that the value of a data point can depend on the values of the data points that come before it. For example, the temperature at a given time can depend on the temperature at previous times.

Time dependant [Sequential]data [Source: ResearchGate: Feature-based time series analysis]

Our sentence will have a proper meaning when words are in proper sequence. So Sequential data and its order have a lot of importance in applications. The modeling related to sequence data is known as sequence modeling.

Then what's wrong with ANN and CNN?

Let's see the reasons why we can't use ANN and CNN for sequential modeling.

1. Fixed input and output neurons

We know once we fixed input and output neurons then we can't change it through iterations. Where the problems like machine translation, we can't be sure how many words will form from translation as an output.

As you can see in the above image, I asked google translate to work for translation in Hindi. In English, I wrote only 6 words, but in Hindi, it resulted in 9 words. It proved my fact that in such scenarios, output never will be fixed and we can't assign an exact number of output neurons to it.

2. Parameters sharing

using convolution operation we can share the parameters. Here, the use of ANN or Artificial Neural Network doesn’t allow you to do that. Also, the most important part of all is the sequence. Artificial Neural Network doesn’t work in this case. what if I slightly changed the words from a sentence but the translation of context or meaning of the sentence is the same? ANN won't figure out about similar output because parameters are not shared.

3. Computations

Let's take an example of name—entity recognition. If I want to recognize a person’s name then in normal cases, I have to do a one-hot encoding where I can make columns and the person’s column becomes 1 and all others become 0.

in this case, for all entities and for all corpus of words, we have to do that. that will make input vectors very big. and it ultimately results in very high computations and a lot of sparse matrices.

4. Independent of previous outputs

When we are working with ANN, we assume that prediction on one label/category will be independent of the next prediction. Because each example is treated as independent. But what if I want to predict the next word or I want to make a bot taking previous outputs into consideration? In such scenarios, we can only use RNN.

Due to this many problems, we searched for a method that will be helpful in all the above scenarios.

Working of RNN

RNNs are typically made up of a series of interconnected nodes. Each node has a number of inputs and outputs, and each input is multiplied by a weight before being added to the node’s output. The weights on the connections between nodes are learned during the training process. The goal of training is to find a set of weights that minimizes the error between the network’s predictions and the actual data.

This is an image of the RNN network from stanford.edu. This looks a bit hard and confusing, so let's understand it in the simple form.

This is the basic block of RNN. The input X goes in h which is known as CELL STATE where the activation function is present. So At the start, input goes through activation, and after processing, we get output. Simple right? Let's give this network more form of RNN.

The second figure has an arrow, which is starting from h and again ending at h. What is that?

It is a simple representation that is suggesting about the repetition of the same architecture of the network. We discussed earlier sequential data. so each word or each date will serve as a separate input. we called it a time-stamp in RNN. So the same network is repeated many times for processing each time stamp.

If we unfold/unroll the basic unit, then it will look something like the above image.

This thing should be noted that “ We are using the same architecture multiple times, the time stamps are different, the network is same.”

The RNN takes an input sequence and creates a hidden state[h(t)]. The hidden state is then used to predict the next output in the sequence.

What happens in a hidden state?

The hidden state h(t) at time t is a representation of the network’s current state of knowledge. It is calculated as a function of the current input and the previous hidden state, and it is used to predict the next output.

h(t) = f(U x(t) + W h(t−1))

The hidden state can be thought of as the network’s “memory,” as it stores information about the sequence of inputs that have been processed so far. This allows the network to learn long-range dependencies in the data, which is essential for tasks such as natural language processing and machine translation. h(t) is calculated based on the current input and the previous time step’s hidden state:

h(t) = f(U x(t) + W h(t−1))

The predicted output is then fed back into the RNN as input, and the process repeats. This process is repeated until the end of the input sequence is reached. The RNN learns to predict the next output in the sequence by adjusting the weights of its connections based on the error between the predicted output and the actual output.

for handling long-term dependencies, The most common activation functions used in RNN modules are described below:

Shorts: How to choose an activation function?

How to choose an activation function?

How to choose an activation function?medium.com

How does RNN update weights?

The weights of an RNN are learned through a process called backpropagation. Backpropagation is an algorithm that calculates the gradient of the loss function with respect to the weights of the network. The gradient is then used to update the weights, in a way that minimizes the loss function. Backpropagation is a technique for training neural networks that are based on the chain rule of calculus. The chain rule states that the derivative of a composite function is the product of the derivatives of the individual functions. In the case of a neural network, the composite function is the loss function, and the individual functions are the activation functions and the weights.

To calculate the gradient of the loss function with respect to the weights, we can use the chain rule to break down the loss function into a product of terms. Each term in the product is the derivative of an activation function or a weight. The gradient of the loss function is then the sum of all of these terms. Once we have calculated the gradient of the loss function, we can use it to update the weights. The weights are updated using the gradient descent algorithm. The gradient descent algorithm is an iterative algorithm that updates the weights in a way that minimizes the loss function.

The backpropagation algorithm is repeated for each training example. As the network is trained, the weights are updated so that the loss function is minimized.

In the last stage of an RNN network, the hidden state of the final cell is used to make a prediction or decision about the next step in the sequence. The hidden state is a representation of the entire sequence, and it is used to capture long-range dependencies between the different steps in the sequence.

Visualization of RNN [Credits: Simplilearn]

So at each timestamp, the output is going to the next layer so overall combined result we are getting at the last layer.

Now we are well known why we don't use ANN or CNN for sequential modeling. Also, we got insights about RNN and its working. Now in the next blog of this series, we will why people reduced the use of RNN, and what are the problems with that. also, we will explore code for RNN with the use case. Until then ! Happy learning!

If you have found this article insightful

It is a proven fact that “Generosity makes you a happier person”; therefore, Give claps to the article if you liked it. If you found this article insightful, follow me on Linkedin and Medium. You can also subscribe to get notified when I publish articles. Let’s create a community! Thanks for your support!

Also, medium doesn't give me anything for writing, if you want to support me then you can click here to buy me coffee.

You can read my other blogs related to

Vision Transformers [ViT]: A very basic introduction

A Simple and basic understanding of how transformers can be used in images

medium.com

Understanding LangChain 🦜️🔗: PART 1

Theoretical understanding of chains, prompts, and other important modules in Langchain

pub.towardsai.net

Understanding Hyper-parameter-tuning of YOLO’s

Different hyper-parameters and their importance in model building