Building a Recurrent Neural Network From Scratch

Long Nguyen
14 min readJan 28, 2024

--

Source: Deeplearningbook.org

I. Overview

In this blog post, we will explore Recurrent Neural Networks (RNNs) and the mathematics behind their forward and backward passes. We will get hands-on experience by building an RNN from scratch in Python using two approaches: first with only NumPy and then again with TensorFlow. By the end, you’ll understand the intricacies of RNN architectures and be able to train models with confidence irrespective of the underlying frameworks.

For those new to neural networks, I have a complementary blog post that walks through implementing a basic neural network architecture from the ground up without libraries, which you can view here.

In this post we will specifically focus on the sequential and recursive nature of RNN models for processing time-based patterns.

II. Recurrent Neural Network (RNN)

Recurrent neural networks are a type of neural network architecture well-suited for processing sequential data such as text, audio, time series, and more. The key aspect that makes RNNs unique is the presence of recursive computational loops that span adjacent time steps. This recurrence enables RNN models to effectively maintain a persistent internal state or ‘memory’ of prior inputs that can inform processing of data points later in the sequence.

Visually, a typical RNN structure consists of a single computational unit with a self-connected hidden state, through which information cycles across timesteps:

Image by Cater N Brown

As data passes through the RNN, the activations from previous time steps flow as inputs to influence the network’s computations on current data. This allows the model to dynamically incorporate temporal context and sequence history, crucial for many seq2seq prediction tasks.

Unfolding RNN

Instead of looking purely at the condensed visualisation of the network, unfolding or unrolling it (with the same parameters replicated across each time slab) reveals the recurrent nature of RNN computations, turning it into a more traditional directed acyclic computational graph consists of multiple time steps. We can then visually see how the input sequence is processed across all time steps, which helps with understanding the forward/backward through time calculation to train the parameters later on.

Parameter sharing

Parameter sharing makes it possible to extend and apply the model to examples of different forms/lengths and generalise across them.

- Source: Chapter 10: Sequence Modeling: Recurrent and Recursive Nets -

Consider training a traditional feedforward neural network vs an RNN to predict the next character on English text sequences. The feedforward net would require separate weights for each input-output mapping at every character position. For two similar sentences — “Today, I went to the dentist” and “I went to the dentist today”, it would learn the contextual relationships completely independently. In contrast, the RNN shares weights temporally, so if it learns rules in one part of a sequence, those can be reapplied later even if the sequence order changes slightly.

Hence, for every new sequence, the feedforward network relearns from scratch even if similar semantic patterns were seen before. It cannot develop a unified grammar model. The RNN embeds language rules as reusable transition functions in the shared weights. So they are retained and applied regardless of input sequence length or order. This means for sequential data, using RNN can result in simpler models with fewer parameters to train while having better performance compared to a traditional NN.

III. The Maths

To understand better how RNN works internally, let’s take a look at the maths behind it.

This is a more detailed view of the unfolded network with multiple variables to be used in the formula calculation.

1. Variables

  • W: Recurrent weight matrix — weights connecting hidden states
  • U: Input weight matrix — weights connecting input to hidden state
  • V: Output weight matrix — weights connecting hidden state to output
  • s: Weighted sum of previous hidden state and current input
  • a: Hidden state activation value
  • o: Weighted output of activation value
  • y^: (“y hat”) Predicted value (normalised probabilities)
  • y: True target value
  • L: Loss value
  • t: Time step
  • b: Bias matrix for hidden states
  • c: Bias matrix for output
  • x: Input vector

2. Forward pass

At each time step, we calculate the weighted sum of the current input (x[t]) and the previous hidden state’s activation value (a[t-1]), then passing it through the non-linearity activation function tanh so that model could learn non-linear relationships. Here, W and U are weight matrices that jointly transform the previous state and current input, and bias b provides bias offsets.

Once we’ve got the activation value a[t] , output of this time step is equal to the weighted activation (with weights V ) and bias c. For classification tasks, the output here may commonly represent logit (unnormalised, raw prediction). Softmax function is then applied to logits as a post-processing step to create y-hat — normalised probability distribution of the output.

When you start at time step t=0 , the previous hidden state’s activation a[t-1] should be initialised to be an all-zero matrix to avoid index out of range error.

This is an example of the data flow — we’re trying to predict the next character given a sequence of character in alphabetical order.

Inputs are discrete character, which are converted to one-hot encoded vectors. The hidden layer and output layer conduct calculations as mentioned in the previous formulas. Based on the output of the softmax function, we can convert that back to the predicted word. Each time step also has a predicted output. The y-hat value at the last time step is your final prediction of the next character of the input sequence.

One-hot encoding

One-hot encoding is a way to express a discrete element (e.g. a word) numerically. One-hot encoding could also be thought of as probabilistic distribution with mass around 1 single value. For our inputs, it’s common to use one-hot encoding, however, you could choose other representations/techniques as well (using pre-trained word embeddings, adding embedding layer, integer index representation, etc)

In the coding example, I’m using the English alphabet as the vocabulary (known size 26) and the RNN is supposed to predict the next character given any random sequence within the alphabet, so I’ll be using one-hot encoding for simplicity as we know the total number of representable characters and since our prediction is also in the format of probabilistic distribution (because of softmax function), it makes sense for the input to be so as well.

For example, for English letters A-Z with vocab size 26:

  • ‘A’ becomes [1,0,0,0,…0]
  • ‘C’ becomes [0,0,1,0,…0]

Activation function

tanh appears to be a popular choice for activation function in RNN so this is what we’re gonna use in this tutorial. There are discussions that you can read more about tanh having larger derivative (compared to sigmoid), which means we can reduce loss function and converge faster.

Loss function

For training the parameters of our RNN model, we need a loss function to measure how well the predicted character probabilities match the true sequence. Cross-entropy loss is a common choice when working with categorical distributions. Cross entropy loss calculates the difference between 2 probability distributions (predicted values and true target values). Since the network consists of multiple time steps, we could calculate the loss at each time step and then the overall loss is the sum of all steps’ losses.

Using cross entropy loss function

By comparing the predicted and actual character distributions at each sequence position, cross-entropy loss provides a gradient signal to update network parameters towards outputs that better match the desired targets. Computing losses timestep-wise is needed for proper attribution during backpropagation later. Our goal is to minimize this objective function to increase prediction accuracy over character sequences.

3. Backpropagation Through Time (BTT)

For the unrolled computational graph, similar to how the sequence flows through the network layer by layer in the forward pass, we calculate the gradients in backward pass layer by layer but from the last time step towards the initial time step (reverse order) instead, hence, the name “through time”.

Zooming into the time step t as an example to derive the formulas to calculate gradients which will be used to update the shared weights and biases. The idea is that if we want to minimise the loss, we need to know how much we should “nudge” the parameters for each time step so as to impact the loss negatively (decreasing it) and by how much.

Because the parameters are not directly contributing to the loss function, but are in multiple nested function, to find the derivatives we need to utilise chain rule and sum rule extensively and start working from the loss function down to the function containing the weights and biases.

Starting from the loss function, since

using the derivative of natural log, the derivative of loss function w.r.t the predicted value is then

Moving down to the derivative of loss w.r.t the weighted output, using the derivative of softmax formula, we get

Gradients of V and c:

With the gradient of the weighted output in mind, we can calculate the gradient of its bias c using the chain rule

and weight matrix V also using the chain rule

Gradients of U, W and b:

Before the gradients of these parameters are calculated, some intermediate values need to be computed.

Since activation of step t contributes to output at step t as well as the activation of the next step t+1 , we need to add the gradients of the current output with the next time step’s loss to get the total gradient of the current activation.

The gradient of the current activation w.r.t current output

and the gradient of current activation w.r.t next time step’s loss

adding both together

For the gradient of the weighted sum of current time step, we use the derivative of tanh

Now we can find the gradients of W, U, and b

That’s all of the gradients that we need to update the weights and biases! Before we do that, notice how

similarly,

In the code, we’ll be calculating this for every time step t and store it in a variable so that it can be used in the next time step t-1 for calculation.

The gradient of current step’s loss w.r.t previous activation is

Updating all parameters based on the gradients of the current time step t

4.Gradient vanishing/exploding

Because we’ll be repeating the weight update logic for every time step, there’s a high chance that we might run into the notorious gradient vanishing/exploding problem with RNN. Repeated multiplication either large or small values can cause the gradients to exponentially increase (explode) or decrease (vanish).

For vanishing gradient, the gradient just becomes too small which could lead to loss of dependency information as it gets to the initial time steps and slow convergence. On the other hand, exploding gradient or gradient getting too big could make the optimisation very unstable and loss function bouncing off back and forth away from global minima.

We can replicate this when running the code, assuming

  • A high learning rate: 0.5
  • High number of hidden layers: 128

we can observe the symptom of exploding gradient which is our loss’s continuous, rapid growth.

epoch=0
Loss: [138.87545009]
epoch=1
Loss: [1615.33636469]
epoch=2
Loss: [2980.75967274]
epoch=3
Loss: [nan]
epoch=4
Loss: [3193.67601878]
epoch=5
Loss: [nan]
epoch=6
Loss: [nan]
epoch=7
Loss: [nan]
epoch=8
Loss: [nan]
epoch=9
Loss: [nan]

There are acouple of ways to deal with this, we can

Reducing the learning rate might help but it will be at the cost of really slow convergence. Reducing the number of hidden layers can also work but it is at the cost of not being able to capture long-term dependencies which could hinder improvement. Gradient clipping is one of the most popular approaches to solve this issue with high accuracy and low effort to implement.

There are other architectures that could further address this issue more efficiently such as LSTM which we will cover in the future.

IV. Implementing a Vanilla RNN with Numpy

Armed with the mathematical foundations, we now have all the pieces needed to implement our own RNN architecture in code for sequence predictions.

As mentioned, we’ll train the network to model English alphabet characters — feeding it subsequences and predicting probable next letters. Our vocabulary is the 26 letters A-Z, so vectors will be length 26 after one-hot encoding each input character into a vector.

This is what the sample data for training looks like

I created an utility function that converts the list of strings into a list of one-hot encoded vectors

Each input sequence has 26 characters and each character (e.g “A”, “B”) will become a list of 26 items, with the item matching its index in the alphabet equals to 1 while the rest are 0. So each input sequence will have the shape of (26, 26, 1).

Input Layer

Let’s start with the InputLayer class.

Variables

  • inputs which is the sequential data in the form of numpy arrays.
  • U is the weight matrix connecting input to hidden layer.
  • delta_U is the gradient calculated during BPTT.

Functions

  • get_input : return the one-hot encoded vector of the character at a given time step
  • weighted_sum : return the result of U . x[t] to be used in
  • calculate_deltas_per_step : calculate the gradient of U at a given time step. The formula delta_weighted_sum @ self.get_input(time_step).T maps to this formula:
  • update_weights_and_bias : updating the parameters using the gradient

Hidden Layer

Variables

  • states : Stores activation of all time steps (internal memory of the network)
  • W : recurrent weight matrix
  • delta_w : gradient of W during BPTT
  • bias : b in the math formulas
  • delta_bias : gradient of b
  • next_delta_activation : stores the derivative of next step’s loss function w.r.t current activation, from this formula

Functions

  • get_hidden_state : return the hidden state value at a given time step. If time step is less than 0, default to all zeros matrix
  • set_hidden_state : updating the state at a time step after forward pass calculation.
  • activate : forward pass calculation
  • calculate_deltas_per_step : compute gradients of W and b
  • update_weights_and_bias : updating the parameters using the gradient

Output Layer

Variables

  • states : Stores predictions of all time steps (internal memory of the network)
  • V : output weight matrix
  • delta_v : gradient of V during BPTT
  • bias : cin the math formulas
  • delta_bias : gradient of c

Functions

  • predict : forward pass to calculate the weighted output and probability distribution with softmax
  • get_state : return the output state (prediction) value at a given time step
  • set_state : updating the output state at a time step after forward pass calculation.
  • calculate_deltas_per_step : compute gradients of Vand c
  • update_weights_and_bias : updating the parameters using the gradient

RNN

Variables

  • hidden_layer : An instance of the hidden layer
  • output_layer : An instance of the output layer
  • input_layer : An instance of the input layer
  • alpha : The learning rate

Functions

  • feed_forward : Forward pass — iterating through the input sequence and coordinate functions from other layers to compute prediction at each time step
  • backpropagation : Backward pass — iterating through the sequence but in reverse order, calculating and updating the gradients for all parameters and eventually updating weights and biases.
  • loss : Cross-entropy loss function
  • train : Run through the number of epochs given, and iterate each input sequence with both forward and backward pass and calculate the loss.

Main.py

The main file consists of some mock data for sample training and testing. An RNN instance is created and train with the one-hot encoded vectors and then tested with a sample new_inputs . This is what the output on terminal looks like

And that concludes our vanilla RNN implementation using only NumPy and basic data structures! While math-heavy, fully understanding the equations governing recurrent networks equips us to architect them from scratch before reaching for libraries.

IV. Implementing a Vanilla RNN with Tensorflow

Now as a contrast, let’s showcase an implementation powered completely by TensorFlow with just a few lines of high-level code:

As you can see the implementation is now much shorter and all we need is a few configuration for the network. TensorFlow provides optimized RNN building blocks, abstracting away most mathematical operations discussed earlier into simple layer construction. This facilitates incredibly quick prototyping and development at scale, but might not be the best for learning.

The details of Tensorflow can be found on its documentation so I won’t bother you with that here.

In this post, we have covered core components of simple RNN architectures, tracing mathematical computations from input sequences to loss calculations, and backpropagating gradients through time. By grounding up RNN fundamentals and contrasting raw NumPy and TensorFlow approaches, it is my hope you feel empowered to build and tweak these models towards your own sequential datasets without being intimidated.

As we wrap up, I encourage you to take these learnings forward — play with network hyper-parameters, implement more complex gated units like LSTMs (I know I will), and scale to real-world time series data. Immerse yourself by coding along instead of just reading. The journey towards mastery requires getting hands-on experience and building intuitive understandings that come from experimenting yourself.

Happy coding!

--

--