Mastering Sequential Data: A Deep Dive into RNNs, LSTMs, and GRUs

Ebad Sayed
12 min readJun 24, 2024

--

https://arxiv.org/pdf/1808.10511

Have you ever wondered how these voice assistants understand your voice commands or how chatbots engage in natural conversations? The answer lies in the world of Recurrent Neural Networks (RNNs), a powerful class of artificial intelligence models. In this article, we explore the fascinating realm of RNNs, diving into their architecture and functionality. Additionally, we delve into two popular variants, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which have revolutionized sequential data processing.

RNNs are highly effective for processing sequential data, making them a cornerstone in Natural Language Processing (NLP). In NLP, methods like Bag of Words (BOW), TF-IDF, and WORD2VEC are commonly used to convert text into vectors. However, these methods often fail to capture the sequence information crucial for understanding language context.
The architecture of RNNs allows to maintain memory of previous inputs, which is crucial for tasks where context is important, such as predicting the next word in a sentence. This ability to retain information about previous inputs is maintained in the network’s hidden state or memory state. Unlike traditional neural networks, which treat all inputs and outputs as independent entities, RNNs use the same set of parameters across all inputs and hidden layers, reducing the complexity of the model. This makes RNNs particularly suitable for tasks where sequence information is vital.

How RNN differs from Feedforward NN?

https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.analyticsvidhya.com%2Fblog%2F2020%2F02%2Fcnn-vs-rnn-vs-mlp-analyzing-3-types-of-neural-networks-in-deep-learning%2F&psig=AOvVaw0cgrDRmcTGRUmbCU7eXcJ_&ust=1719333549278000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCPDX-t_W9IYDFQAAAAAdAAAAABAJ

Artificial neural networks without looping nodes are known as feed-forward neural networks. These networks are also called multi-layer neural networks because information flows only in one direction: from the input layer to the output layer, passing through any hidden layers. They are suitable for tasks like image classification, where inputs and outputs are independent. However, their inability to remember previous inputs makes them less effective for analyzing sequential data.

Recurrent Neuron

The core processing element in an RNN is known as a Recurrent Unit, often referred to as a “Recurrent Neuron.” This unit uniquely maintains a Hidden State (Memory State), enabling the network to capture sequential dependencies by remembering previous inputs during processing. Enhanced versions like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) improve the RNN’s capability to manage long-term dependencies effectively.

https://www.google.com/url?sa=i&url=https%3A%2F%2Fcamrongodbout.medium.com%2Frecurrent-neural-networks-for-beginners-7aca4e933b82&psig=AOvVaw20JRiBTCM4g_6qO5GaI55Z&ust=1719333615716000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCMCiooTX9IYDFQAAAAAdAAAAABBU

There are four types of RNNs based on the number of inputs and outputs in the network.

https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.datacamp.com%2Ftutorial%2Ftutorial-for-recurrent-neural-network&psig=AOvVaw1AvN-sDP5izjUYthtHERLJ&ust=1719333735056000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCLDe27XX9IYDFQAAAAAdAAAAABAr

1. One to One :- This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural Network. In this Neural network, there is only one input and one output.

2. One to Many :- In this type of RNN, there is one input and many outputs associated with it. One of the most used examples of this network is Image captioning where given an image we predict a sentence having Multiple words.

3. Many to One :- In this type of network, Many inputs are fed to the network at several states of the network generating only one output. This type of network is used in problems like sentimental analysis. Where we give multiple words as input and predict only the sentiment of the sentence as output.

4. Many to Many :- In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem. One Example of this Problem will be language translation. In language translation, we provide multiple words from one language as input and predict multiple words from the second language as output.

RNN Architecture

RNNs share the same input and output architecture as other deep neural networks. However, they differ in the way information flows from input to output. Unlike deep neural networks, where each dense layer has distinct weight matrices, RNNs use the same weight matrices across the network. The inputs are words so first we have to convert them into numbers. For that we make a vocabulary which will contain all the words that we are going to use mapped with an index. Suppose the vocabulary is of size (n_x,1) where n_x is the total number of words in the vocabulary. Then we will convert each word of every sentence into an One Hot Vector. So we have a sentence “The quick brown fox jumps over the lazy dog.”, There are 9 words hence 9 one hot vectors will be formed. If the index of the first word “The” is let’s say 18 in the vocabulary, then the 18th indeix of one hot vector will be ‘1’ rest all (n_x — 1) positions will be ‘0’.

Image by Author

In RNN we have two inputs one is the activation of the previous layer and other is the input X which is an one hot vector of a word. We also produce an output here which is O. So at one timestep the RNN cell produces the activation and the output.

https://miro.medium.com/v2/resize:fit:1074/1*Vt7tkMDupqcYk0pgl2-uig.png

In the above diagram, X are inputs (one hot vector for each word in the sequence) which are passed into the hidden state with some weights, then the hidden state calculate the corresponding outputs O and activation, and the activation is passed into the next neuron.
NOTE :- For many-to-one application, we do not produce output at every neuron, only the last neuron produce it.

Backpropagation

Image by Author

Consider many-to-one RNN with three neurons. Here we will use categorical cross-entropy as our loss function. O is the output and y is the actual value or label. Since a3 depends further on a2 and a1 we need to calculate further.

Image by Author

By applying chain rule we can get the values of derivatives of loss function with respect to the model parameters.

Image by Author

This will be the generalized formula to find gradients of the loss function with respect to the model parameters.

Issues of Standard RNNs

Problem with RNN is when the activation from one hidden state is passed to the next one, it is updated so many times that the information stored in the activation are lost. So if we pass a very long sentence, it will loss the information of the earlier words. Means the memory it has to store the information is short.

1. Vanishing Gradient :- During backpropagation when we calculate the gradient from the last timestep to the first, we multiply lots of derivative terms whose values are in range (0,1), due to this the final value becomes very very small. So the information of the earlier words becomes negligible.

2. Exploding Gradient :- The exploding gradient problem arises during the training of a neural network when the slope increases exponentially instead of decaying. This issue is caused by large error gradients that accumulate, resulting in very large updates to the neural network’s model weights.

Advantages of RNNs

1. An RNN remembers each and every piece of information through time. It is useful in time series prediction only because of the feature to remember previous inputs as well. This is called LSTM.
2. RNNs are even used with convolutional layers to extend the effective pixel neighborhood.

Disadvantages of RNNs

1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.

Variation Of RNN

To overcome the problems like vanishing gradient and exploding gradient descent several new advanced versions of RNNs are formed some of these are as:

Bidirectional Neural Network (BiNN)

A BiNN is a variation of a Recurrent Neural Network in which the input information flows in both directions and then the output of both directions are combined to produce the input. BiNN is useful in situations when the context of the input is more important such as NLP tasks and Time-series analysis problems.

Long Short-Term Memory (LSTM)

LSTM works on the read-write-and-forget principle where given the input information network reads and writes the most useful information from the data and it forgets about the information which is not important in predicting the output. For doing this three new gates are introduced in the RNN. In this way, only the selected information is passed through the network.

Gated Recurrent Unit (GRU)

GRUs are a variation of the LSTM architecture, designed to be simpler and more computationally efficient. They combine the forget and input gates of the LSTM into a single “update gate” and merge the cell state and hidden state. GRUs are often used as a more efficient alternative to LSTMs for certain tasks.

Difference between RNN and Simple NN

Image by Author

Long Short-Term Memory (LSTM)

https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/LSTM_Cell.svg/1200px-LSTM_Cell.svg.png

In LSTM, there are two states instead of just one hidden state as in RNNs. The first state (c) is known as the memory state also known as Long Term state, and the second state (h) is the hidden state also known as Short Term state. The memory state retains information for future use, acting as a long-term memory. The hidden state functions similarly to the hidden state in RNNs, maintaining short-term information for processing.

The main reason how LSTM is able to remember the previous informations is becasue it uses the concept of gates. Gates are used to control the flow of information in the network. There are three types of gates: forget gate, input gate and output gate.

Image by Author

The forget gate (fₜ) is a binary matrix composed of zeros and ones. A value of one indicates that the information will be retained, while a value of zero means the information will be discarded. Next, we have the input gate (iₜ), which is multiplied with the candidate value (Cₜ), akin to the activation (aₜ) equation from the RNN. If the sigmoid function’s output is close to zero, it indicates that the information should be forgotten. Conversely, a value close to one signifies that the information should be retained, acting as gates in the process.

How do these gates determine which information to retain and which to forget? This is governed by weight matrices. During training, these weights are adjusted in such a way that the model learns to differentiate between useful and irrelevant information. This understanding is developed by analyzing thousands to millions of data points.

Image by Author

The value of the hidden state will be computed by multiplying by the output gate (oₜ) with tanh by passing (Cₜ). The hidden state is provided as output to make predictions.

Disadvantages of LSTM

Complexity: LSTMs are more complex than traditional RNNs due to the additional gates (forget, input, and output gates) and states (memory state and hidden state). This complexity can make them harder to understand and implement.

Computationally Intensive: The added complexity of LSTMs means that they require more computational resources, which can lead to longer training times and increased memory usage compared to simpler models.

Difficulty in Training: Despite their design to mitigate the vanishing gradient problem, LSTMs can still suffer from this issue, especially when dealing with very long sequences. Additionally, they can be prone to overfitting if not properly regularized.

Parameter Tuning: LSTMs have many hyperparameters (e.g., number of layers, number of units per layer, learning rate) that need to be carefully tuned. This process can be time-consuming and requires expertise.

Scalability: Scaling LSTMs to very large datasets or extremely long sequences can be challenging, both in terms of computational resources and the time required for training.

Gated Reccurent Unit (GRU)

https://www.researchgate.net/publication/334385520/figure/fig1/AS:779310663229447@1562813549841/Structure-of-a-GRU-cell.ppm
Image by Author

GRUs has a much more simple architecture than LSTMs. In GRU we have only two gates and only one hidden state. Hence the number of parameters is less and the training time is reduced. GRU can carry long term context as well as short term context within just one state.

The output is produced by combining the hidden state of previous neuron and the candidate gate and the conbination is decided by the the update gate. Then the candidate gate is produced by combining the hidden state of previous neuron and the input for the current neuron. This combination is decided by the reset gate.

The hidden state (hₜ) is a vector that stores some context hence called memory. And each element in this vector stores some aspect of the sequence. So suppse the vector has four values [0.1, 0.05, 0.3, 0.8] representing [power, anger, revenge, love]. At every time step as we pass some new sentences these values are updated. Suppose at time step 2 we passed a sentence “Both the kings had a war”, so this will change the vector as [0.1, 0.2, 0.35, 0.7]. And these changes are done by using reset gate and update gate. So if rₜ = [0.8, 0.2, 0.1, 0.9] that means we want to retian 80% of the information of power from the previous hidden state similarly 20% for anger, 10% for revenge and 90% for love.

Image by Author

Deep RNN

https://d2l.ai/_images/deep-rnn.svg

A Deep RNN is an extension of the basic RNN architecture that involves stacking multiple RNN layers on top of each other. This stacking allows the network to capture more complex patterns and dependencies in the data by increasing its depth, similar to how deep feedforward networks work. Each layer in a Deep RNN passes its hidden state to the next layer at the same time step, enabling the network to learn more abstract features from the sequential data.

Deep RNNs are particularly useful for tasks that require understanding long-term dependencies and intricate patterns in sequential data, such as natural language processing, speech recognition, and time series prediction. However, they are also more challenging to train due to issues like vanishing and exploding gradients.

Bidirectional RNN

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ1EjjcmJ2X9qSqtcJamBBMKc8UV76Tro62mw&s

Bidirectional Recurrent Neural Networks (BiRNN) are a type of RNN that processes data in both forward and backward directions. In a standard RNN, the data is processed only in one direction (forward) from the start to the end of the sequence. However, in many tasks, having context from both past and future data can be highly beneficial.

Standard RNNs only consider the past context when making predictions, which might not be sufficient for understanding the current state. Whereas BiRNNs process the data in both directions, they can utilize the context from both past and future inputs, leading to a better understanding of the sequence as a whole.

In tasks like language translation, speech recognition, and sentiment analysis, the context from future words can significantly influence the understanding of the current word. For example, the meaning of a word in a sentence can depend on the words that come after it, not just the ones before. BiRNNs achieve better performance on these tasks by leveraging the additional context, leading to more accurate and robust predictions.

Standard RNNs might struggle with ambiguities that can only be resolved by looking at both preceding and succeeding data points. BiRNNs provide a mechanism to handle such ambiguities by considering the entire sequence.

BiRNN Architecture

Image by Author

BiRNN is made up of two RNNs: Forward Layer to processes the sequence from start to end and Backward Layer to processes the sequence from end to start. At each time step, the output of the forward and backward RNNs is concatenated or summed to form the final output. This combined output captures information from both past and future contexts.
BiRNN processes the input sequence one time step at a time, considering both past and future context at each step.

Example: Consider a sentence: “The quick brown fox jumps over the lazy dog.”
The forward RNN processes the sequence as: “The” → “quick” → “brown” → “fox” → “jumps” → “over” → “the” → “lazy” → “dog.”
The backward RNN processes the sequence as: “dog.” → “lazy” → “the” → “over” → “jumps” → “fox” → “brown” → “quick” → “The”
At each time step, the outputs from both directions are combined to form the final output, capturing the context from both past and future.

Summary

RNNs are foundational for processing sequential data but struggle with capturing long-term dependencies. BiRNNs address this by processing sequences in both directions, capturing information from both past and future. LSTMs and GRUs are advanced RNN architectures that improve the ability to capture long-term dependencies through sophisticated gating mechanisms. LSTMs are more complex but can capture more nuanced dependencies, while GRUs are simpler and more computationally efficient.

Next Article :- Building RNN, LSTM, GRU from Scratch

--

--

Ebad Sayed

I am currently a final year undergraduate at IIT Dhanbad, looking to help out aspiring AI/ML enthusiasts with easy AI/ML guides.