Recurrent Neural Networks in Machine Learning

10 min readDec 26, 2023

Most types of data and information like text, speech, time series data, etc. can be represented in a sequential structure. Given its wide range of use cases and applications, researchers have always been interested in creating AI models that can understand sequential information. Recurrent Neural Networks (RNNs) are a class of neural networks that are designed to process sequential data by capturing dependencies and patterns over time. It’s foundation and history can be traced back to the early developments in neural networks and the desire to model and process sequential information effectively.

Natural Language is also sequential information since it is composed of a sequence of words or characters that are arranged in a particular order to convey meaning and express thoughts, ideas, or instructions. Therefore, RNN’s are also widely applicable in the field of Natural Language Processing and paved way for today’s advanced AI models for NLP. In this article, we’ll explore internal details of RNN and also set the stage for more advanced neural networks.

Memory in the Neural Network

To understand and process sequential data, a neural network has to be able to maintain memory and capture dependencies over time, which is crucial for tasks that involve sequential data processing, such as language modeling, speech recognition, and time series analysis. A traditional neural network contains an input layer, multiple hidden layers and an output layer. This does not allow it to maintain context over time, since its only concerned with the current input and using hidden layer neurons to generate desired output, which can be used for classification or prediction purposes.

RNN’s add the ability to maintain context in a neural network, by replacing the hidden layer with a recurrent connection layer, that maintains the internal state or memory of past inputs(also known as the Hidden State), which is then used to generate the output using this internal state and new input. This allows the RNN to incorporate the memory of past inputs into the new output, hence maintaining context over time.

A sequential input is often represented as a sequence of objects at different time steps, such that first object is at time step t1, second object at time step t2 and so on. We will often refer to input’s, output’s and internal state’s in terms of the time step, in rest of this article.

The internal state, often referred to as the hidden state of RNN, is updated at each time step using the input vector and the previous hidden state. This update is performed by applying a set of weights and activation functions to the inputs and the previous hidden state. The hidden state serves as a summary of the information received up to that point and carries context and memory from previous time steps.

A basic RNN is represented as follows

Image displaying the layout of a basic RNN. On the left we have the simple RNN architecture and on the right we have the same RNN showing its state at different time steps t-1, t and t+1 and so on, for a sequential input. x represents input state, h is the hidden state and o represents output state.

In the above figure, you can see the 3 layers of a basic RNN. The main differentiator is the Hidden layer or the recurrent connection layer. Let’s take a closer look at what each of the above components mean.

Weights(Gradients) and Biases

Just like a traditional deep learning network, RNN’s also have weights and biases that govern the transformations applied to the input and hidden state’s. These weights are learned during the training process using techniques like backpropagation. In the above figure, U, V and W, represent these weights for input, hidden and output layers respectively.

Input Layer

The input layer of an RNN consists of individual neurons or units representing the input features at each time step.

The size of the input layer is determined by the dimensionality or the number of features used to represent the input at each time step. In the context of language modeling, where words are commonly used as inputs, the size of the input layer is often based on word embeddings or one-hot encoded vectors.

One-Hot Encoding: If one-hot encoding is used, each word in the vocabulary is represented as a unique binary vector of size equal to the vocabulary size. Therefore, the size of the input layer would be the same as the vocabulary size.

Word Embeddings: If word embeddings are used, each word is represented as a dense vector of a fixed dimensionality, typically smaller than the vocabulary size. In this case, the size of the input layer would be determined by the dimensionality of the word embeddings.

It’s important to note that the size of the input layer is fixed and remains the same across all time steps in the RNN. Each time step receives an input vector, which is typically a one-hot encoded vector or a word embedding representation, and processes it through the recurrent connections along with the hidden state.

Recurrent connection layer(internal or hidden state)

The recurrent connection in a RNN is a fundamental component that enables the network to maintain memory and capture dependencies over sequential data. The recurrent connection is established by connecting the hidden state of the current time step to the hidden state of the previous time step. In other words, the output or hidden state at time step t-1 serves as an additional input to the network at time step t. This connection forms a loop within the network, creating a feedback mechanism that allows the network to carry information from the past into the present.

Mathematically, the recurrent connection can be represented as follows:

h(t) = f(Wx * x(t) + Wh * h(t-1) + b)

where:

h(t) is the hidden state at time step t.
x(t) is the input vector at time step t.
U, V and W are weight matrices that control the transformation of the input, hidden and output states, respectively.
b is the bias vector.
f() is the activation function, such as the sigmoid or hyperbolic tangent function, that introduces non-linearity to the network.

The size of the hidden state in an RNN is a hyperparameter that is defined during the design of the model. It determines the dimensionality or the number of neurons in the hidden state vector at each time step. The choice of hidden state size is based on factors such as the complexity of the task, the amount of available training data, and the desired capacity of the model.

A larger hidden state size allows the RNN to capture more complex patterns and dependencies in the data but comes at the cost of increased computational resources and potentially higher training requirements. On the other hand, a smaller hidden state size may limit the expressive power of the RNN.

It’s important to strike a balance when choosing the hidden state size, as a size that is too small may lead to underfitting, where the RNN fails to capture important patterns, while a size that is too large may result in overfitting, where the RNN becomes too specialized to the training data and performs poorly on new, unseen examples.

Output Layer

The output size is determined by the design and requirements of the RNN model, and it can vary depending on the specific task at hand.

In the case of language modeling, where the goal is to predict the next word in a sequence given the previous context, the output size is typically equal to the vocabulary size. Each element in the output array represents the probability or likelihood of a word in the vocabulary being the next word in the sequence.

A Simple Example in Play

Let’s consider a simple example of training a traditional RNN using a single sentence as input. Suppose we have the sentence “I love cats and dogs.” consisting of 5 words. We’ll assume a basic RNN architecture with a hidden state size of 3 and a vocabulary of unique words.

1. Preprocessing

Tokenization: The sentence is tokenized into individual words: [“I”, “love”, “cats”, “and”, “dogs”].
Vocabulary Creation: A vocabulary is created by assigning a unique index to each word: {“I”: 0, “love”: 1, “cats”: 2, “and”: 3, “dogs”: 4}.

2. Input Representation

One-Hot Encoding: Each word in the sentence is represented as a one-hot encoded vector. For example, the input vector for the first time step, representing the word “I,” would be [1, 0, 0, 0, 0] since it corresponds to the first word in the vocabulary.

3. Forward Pass:

Time Step 1:
- Input: [1, 0, 0, 0, 0] (representing “I”)
- Previous Hidden State: [0, 0, 0] (initialized)
- Calculation:
— Hidden State: h(t) = f(Wx * x(t) + Wh * h(t-1) + b)
— For the first time step, h(t-1) is initialized to zeros.
- Output: [0.2, 0.4, 0.3] (example values)
Time Step 2 (and subsequent steps):
- Input: One-hot encoded vector for the corresponding word at each time step.
- Previous Hidden State: Hidden state output from the previous time step.
- Calculation:
— Hidden State: h(t) = f(Wx * x(t) + Wh * h(t-1) + b)
— Output: Generated at each time step based on the hidden state.
Output: The RNN can generate an output at each time step based on the hidden state. The specific output generation depends on the task. For example, in this case of language modeling, the output can be a probability distribution over the vocabulary.

Backpropagation Through Time

Backpropagation in a Recurrent Neural Network (RNN) is an extension of the traditional backpropagation algorithm used in feedforward neural networks. It is known as “Backpropagation Through Time” (BPTT) and is designed to handle the recurrent connections in the RNN architecture.

The BPTT algorithm for RNNs involves unfolding the recurrent connections over time to create a computational graph that resembles a feedforward neural network. This unfolded graph represents the RNN as a series of interconnected layers, where each layer corresponds to a time step.

The basic steps of BPTT in RNNs are as follows:

Forward Pass: The input sequence is processed through the RNN using a forward pass, as described earlier. The hidden states and outputs are computed at each time step.
Compute Loss: The computed outputs are compared to the desired outputs (targets) to calculate the loss. The choice of loss function depends on the specific task and the type of output.
Backward Pass: Starting from the last time step, gradients are calculated with respect to the parameters of the network. The gradients capture the influence of each parameter on the final loss. The gradients are calculated using the chain rule, similar to traditional backpropagation.
Gradient Updates: The gradients are used to update the network’s parameters, such as the weight matrices and biases, in the direction that minimizes the loss. This update step is typically performed using an optimization algorithm like gradient descent or its variants.

The key difference between traditional backpropagation and BPTT lies in the handling of the recurrent connections. BPTT unrolls the network over time, allowing the gradients to flow through the unfolded graph. This way, the gradients can be calculated and propagated back through the recurrent connections, capturing the dependencies and memory of the RNN.

Vanishing and Exploding Gradients

The problems of vanishing and exploding gradients are common challenges in training Recurrent Neural Networks (RNNs) that can have a significant impact on their usage in real-world applications.

Vanishing Gradients: In RNNs, vanishing gradients occur when the gradients calculated during backpropagation diminish exponentially as they propagate backward through time. This means that the gradients become extremely small, leading to slow learning or stagnation of the training process. It happens when the recurrent connections in the network repeatedly multiply small gradient values, causing them to shrink exponentially.
Exploding Gradients: Conversely, exploding gradients occur when the gradients grow exponentially as they propagate backward through time. This results in very large gradient values, which can cause numerical instability during the training process and make the model’s parameters update in large and erratic steps.

Both vanishing and exploding gradients hinder the training of RNNs by making it difficult to effectively update the network’s parameters and converge to an optimal solution. This can have several implications for real-world applications:

Long-Term Dependencies: RNNs are designed to capture long-term dependencies in sequential data. However, vanishing gradients can make it challenging for RNNs to effectively model and capture these dependencies over long sequences, limiting their ability to learn and generalize from distant past information.
Training Stability: Exploding gradients can make the training process unstable and unpredictable, leading to difficulty in converging to an optimal solution. This instability can result in erratic parameter updates and make it harder to train RNNs reliably.
Gradient-Based Optimization: The presence of vanishing or exploding gradients can hinder the effectiveness of gradient-based optimization algorithms, such as gradient descent, which rely on stable and well-scaled gradients for proper weight updates. These issues may require additional techniques like gradient clipping, regularization methods, or more advanced RNN architectures (e.g., LSTM or GRU) to alleviate the problem.
Memory and Context: RNNs are particularly useful for tasks that require capturing sequential information and maintaining memory or context. However, the presence of vanishing gradients can limit their ability to retain and propagate relevant information over long sequences, potentially impacting the model’s understanding and contextual reasoning abilities.

Addressing the problems of vanishing and exploding gradients has been an active area of research. Techniques like gradient clipping, weight initialization strategies, and more advanced RNN architectures with gating mechanisms (e.g., LSTM and GRU) have been developed to mitigate these issues and enable more stable and effective training of RNNs in real-world applications.

Recurrent Neural Networks (RNNs) have made significant contributions to the development of modern AI models such as the Transformer and GPT (Generative Pre-trained Transformer). RNNs introduced sequential modeling, enabling the processing and generation of data with temporal dependencies. They captured long-term dependencies, improved language modeling, and introduced gating mechanisms like LSTM and GRU. These advancements in RNNs paved the way for the Transformer model, which revolutionized natural language processing with self-attention mechanisms for parallel processing and capturing global dependencies. The Transformer, in turn, served as the foundation for models like GPT, leveraging large-scale pre-training and fine-tuning to achieve state-of-the-art performance in various language-related tasks.

If you liked this article, be sure to clap below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Twitter or find me on linkedin. I’d love to hear from you.

That’s all folks, Have a nice day :)