Step 6: Understanding Recurrent Neural Networks

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

11 min readSep 6, 2023

In the world of Natural Language Processing (NLP), Recurrent Neural Networks (RNNs) have emerged as a powerful tool for processing and understanding sequential data. Whether it’s language translation, sentiment analysis, or speech recognition, RNNs have proven to be highly effective in capturing temporal dependencies.

Introduction

Recurrent Neural Networks (RNNs) are a type of neural network designed to process sequential data, such as time series, text, or speech. One of the key features of RNNs that make them different from other neural networks is that they have “memory” in the form of hidden states, which allow the network to maintain information from previous time steps and use it to influence its predictions or decisions at later time steps.

The architecture of an RNN is structured as a sequence of interconnected neural network “cells” that form a linear chain, where the output of one cell is passed as input to the next cell. Each cell is responsible for taking both the current input data and the hidden state carrying some information learned from the previous time step, producing an output and a new hidden state. This hidden state is represented as a fixed-size vector, and it is the key to the network’s ability to maintain information from previous time steps.

Image by Motaz Alfarraj. Showing the difference between a normal Feed Forward Network and RNN.

Understanding a Recurrent Neural Network

Now, let us delve deeper into the RNN architecture. Take a look at the image below,

An RNN architecture generally takes a 3-dimensional input, namely batch size, the number of timesteps, and dimensions (can be univariate or multivariate). There can be many recurrent layers, as shown in the above image. Finally, for prediction, there will be a dense layer.

Unfolding a Recurrent Neural Network

Here we’d try to visualize the RNNs in terms of a feedforward network. A recurrent neural network can be thought of as multiple copies of a feedforward network network, each passing a message to a successor.

In the above image,

The time steps are denoted as X0, X1,…, and X29, and each represents a different point in time.
The blue-colored boxes represent inputs at different time steps.
Each “MemCell” constitutes the hidden layer where data is fed into the network. These boxes constitute the units of the hidden layer (3∗3 in the above case).
Each instance of the “MemCell” represents the state of this hidden layer at a particular time step. So, Y0 = H0, Y1 = H1, and so on. This means that the hidden layer’s state at each time step is preserved, carrying information over time.
To make predictions or produce outputs at a given time step, the hidden layer’s state must be processed further. This is typically achieved by passing the hidden layer’s state through a dense layer. We’ll dive into this in more detail in the upcoming section.

In the above image, we can observe two recurrent layers followed by a dense layer. To simplify matters, let’s focus on solving a word prediction problem. We possess data for the first 29 time steps i.e., corresponding to the initial 29 words in a sequence. Our objective revolves around predicting the word at the 30th time step (or X30).

To accomplish this, we need to pass the hidden layer’s state from the 29th time step (H29) through a dense layer to generate the output at the 30th time step (X30).

However, if the goal is to predict the output at every time step within the sequence, we must adapt the architecture as follows, where Yi represents the output at each time step:

This modified architecture enables us to generate predictions at each time step (Yi) throughout the sequence.

Now, the fundamental question arises, “How does the RNN architecture actually make predictions and learn from data?” We’ll break this down in the next section.

Forward Propagation

Let’s take a closer look at how forward propagation works in an RNN architecture. We’ll examine the architecture and introduce key formulas that help us understand how the RNN processes information. Like other neural networks, it too depends on weight updation.

In an RNN, calculating the values for the current hidden layer relies on both the current input and the previous hidden layer. To better grasp this concept, we’ll explore essential formulas that reveal the inner workings of the RNN architecture.

We’ll consider tanh as our activation function here.

The formula to compute the output depends on the current hidden layer. It is given as follows:

Using these formulas, the network undergoes the process of fine-tuning and training. Let’s explore how it gets trained:

Certainly, let’s consider the example where we have a vocabulary consisting of the characters ‘h’, ‘e’, ‘l’, and ‘l’, and we want the network to predict the next letter after “he.” To do this, we need to first encode the input using a one-hot encoding. Here’s how we can approach this

Initialize Wx

Compute Wx * Xt (current input, in our case it is ‘h’)

Let’s compute Wh * ht−1 and add some bias bh to it. In this case, we’ll assume Wh and the bias to be simple scalars (1x1 matrices) for simplicity. When there’s no hidden layer before the input Wh * ht−1 will be all zeros. Here’s how we can calculate it. Adding the bias will result in,

Now, let’s calculate the current hidden layer value using the currently hidden layer formula we saw above.

In the next state, ‘e’ of “hell” is supplied to the network. ht will now become ht−1, while the one hot encoded of e becomes Xt. Let us calculate the current state ht for e. Wh∗h(t−1) + bias will be as follows

Wh∗Xt will be as follows:

Now, the hidden layer ht of ‘e’ is as follows:

Now, let us try to predict the next letter after ‘e’ using the output formula yt given above. We’re ignoring any bias by for now.

Now, let’s send yt through a softmax layer prediction of the next letter.

Indeed, it’s apparent from the example that the network predicted the next letter to be ‘h’ with a probability of approximately 0.4197, which turned out to be incorrect. In cases like this, it’s evident that the network requires more extensive training and fine-tuning of its weights and biases.

The process of improving the network’s performance involves techniques like backpropagation through time, which allows the network to learn and adjust its parameters more effectively. By fine-tuning the weights and biases during training, we can enhance the network’s predictive capabilities and rectify such errors. This is precisely what we’ll explore in the next phase.

Backpropagation through Time

When the network makes a prediction that doesn’t align with the true label (as seen in the example where ‘h’ was predicted instead of ‘l’), it calculates the loss, which quantifies the discrepancy between the prediction and the actual target.

This loss is then propagated backward through the network, a process known as backpropagation. During backpropagation, the network identifies how each weight and bias contributed to the error, and it adjusts these parameters gradually to minimize the loss. The objective is to fine-tune the weights and biases associated with the hidden layers and input so that the loss function reaches its minimum value.

Through this iterative process, the neural network learns to make better predictions over time. The network adapts its internal representations to capture patterns and dependencies in the data more accurately, ultimately improving its predictive performance.

Let’s explore how a single weight value is fine-tuned in the context of an RNN network using a simple example. This is based on something you must have already heard of known as Gradient Decent. Imagine we have a parameter value θ1 that influences some arbitrary cost function J(θ1). To better understand the process, let’s visualize the cost function J(θ1) as a function of θ1 using a simple one-dimensional plot:

We calculate the slope of the cost function at the current value of θ1. If the slope is positive, it indicates that decreasing the value of θ1 will result in a decrease in the cost function value and vice versa. Based on the positive slope, we decrease the value of θ1 by a small amount to find a new value. This process continues iteratively until we reach a point where the slope approaches zero.

Following the described method allows us to find the parameter values that minimize the cost function. This process helps determine the optimal weights and biases, enabling the network to make accurate predictions with minimal loss.

Advantages of RNNs

Possibility of Processing Input of Any Length: Unlike traditional feedforward neural networks, RNNs can process sequences of different lengths.
Model Size Not Increasing with Size of Input: In RNNs, the model’s size (the number of parameters) remains fixed regardless of the length of the input sequence. This is because the same set of weights is reused at each time step. It allows RNNs to handle long sequences efficiently without a proportional increase in model complexity.
Computation Takes Into Account Historical Information: RNNs are designed to capture dependencies and relationships in sequential data by maintaining a hidden state that retains information from previous time steps.
Weights Are Shared Across Time: This weight sharing encourages the network to learn and generalize patterns over time. It’s particularly advantageous when dealing with sequences of data, where similar patterns may occur at different time steps.

Issues of Standard RNNs

1. Vanishing Gradient: When calculating gradients for weight updates through backpropagation, the chain of gradients can become very long. If any of these gradients approaches zero, it affects the entire chain, leading to slow learning.

When gradients vanish, it means that the network doesn’t effectively learn long-term dependencies or adjust its parameters to minimize the error.

L —Loss Function,
w — A Weight Value, h — Hidden Layer Weights
y— Output,

If anyone of the gradients approaches zero, then all will approach zero. The learning will stop very soon.

2. Exploding Gradient: The exploding gradient problem typically occurs when the gradients in a deep neural network grow exponentially as they are propagated backward through the layers. This can happen when weight values are initialized too large, activation functions are not well-behaved, or the network architecture has very deep or recurrent structures.

When gradients explode, they can become so large that they cause weight updates to be excessively large as well. This leads to unstable training, divergence, and difficulty in finding a good solution to the optimization problem.

Ways to Resolve Above Issues

Weight Initialization: Techniques like Xavier/Glorot initialization or He initialization are designed to set initial weights in a way that helps stabilize training by avoiding extreme weight values that can lead to gradient issues.
Activation Functions: Rectified Linear Units (ReLU) and its variants like Leaky ReLU are popular choices because they are less prone to vanishing gradients compared to sigmoid or tanh activations.
Batch Normalization: Batch normalization normalizes activations within each mini-batch, reducing the risk of exploding gradients during training and also helping with the vanishing gradient problem by stabilizing activations.
Gradient Clipping: Apply gradient clipping to limit the magnitude of gradients during backpropagation. This technique can prevent both excessively large and small gradients from causing training instability.
Learning Rate Scheduling: Use learning rate schedules that adaptively adjust the learning rate during training. Techniques like learning rate decay or cyclical learning rates can help the network converge more efficiently while avoiding gradient issues.
Advanced Architectures: Consider using specialized architectures designed to address these problems, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. These architectures are designed with mechanisms to mitigate both vanishing and exploding gradients, making them suitable for sequential data.

Summary

Recurrent Neural Networks (RNNs) are a valuable tool in tasks involving sequential data. They excel at maintaining a network state through hidden layers, which is crucial for understanding and predicting sequences.

At each time step, the network combines the hidden layer from the previous step with the current input to compute the hidden layer’s state at the current step. Predictions are made by passing the final hidden layer through a dense layer, often using a Softmax activation.

Challenges in training RNNs include the Exploding Gradient and Vanishing Gradient problems. Gradient Clipping is a solution to manage Exploding Gradients. To enhance RNNs’ ability to capture and remember important information over long sequences, modified versions like LSTM and GRU have been developed, which address both gradient issues and are better equipped to handle long-term dependencies in data. We’ll be covering LSTM and GRU in the coming parts of the series.

We’ll also be having a small project on RNN so that we can learn the code implementation and how it performs better in sequential data where other neural networks ill perform. Till then, Keep Learning and Keep Growing!