Building a Recurrent Neural Network From Scratch
I. Overview
In this blog post, we will explore Recurrent Neural Networks (RNNs) and the mathematics behind their forward and backward passes. We will get hands-on experience by building an RNN from scratch in Python using two approaches: first with only NumPy and then again with TensorFlow. By the end, you’ll understand the intricacies of RNN architectures and be able to train models with confidence irrespective of the underlying frameworks.
For those new to neural networks, I have a complementary blog post that walks through implementing a basic neural network architecture from the ground up without libraries, which you can view here.
In this post we will specifically focus on the sequential and recursive nature of RNN models for processing time-based patterns.
II. Recurrent Neural Network (RNN)
Recurrent neural networks are a type of neural network architecture well-suited for processing sequential data such as text, audio, time series, and more. The key aspect that makes RNNs unique is the presence of recursive computational loops that span adjacent time steps. This recurrence enables RNN models to effectively maintain a persistent internal state or ‘memory’ of prior inputs that can inform processing of data points later in the sequence.
Visually, a typical RNN structure consists of a single computational unit with a self-connected hidden state, through which information cycles across timesteps:
As data passes through the RNN, the activations from previous time steps flow as inputs to influence the network’s computations on current data. This allows the model to dynamically incorporate temporal context and sequence history, crucial for many seq2seq prediction tasks.
Unfolding RNN
Instead of looking purely at the condensed visualisation of the network, unfolding or unrolling it (with the same parameters replicated across each time slab) reveals the recurrent nature of RNN computations, turning it into a more traditional directed acyclic computational graph consists of multiple time steps. We can then visually see how the input sequence is processed across all time steps, which helps with understanding the forward/backward through time calculation to train the parameters later on.
Parameter sharing
Parameter sharing makes it possible to extend and apply the model to examples of different forms/lengths and generalise across them.
- Source: Chapter 10: Sequence Modeling: Recurrent and Recursive Nets -
Consider training a traditional feedforward neural network vs an RNN to predict the next character on English text sequences. The feedforward net would require separate weights for each input-output mapping at every character position. For two similar sentences — “Today, I went to the dentist” and “I went to the dentist today”, it would learn the contextual relationships completely independently. In contrast, the RNN shares weights temporally, so if it learns rules in one part of a sequence, those can be reapplied later even if the sequence order changes slightly.
Hence, for every new sequence, the feedforward network relearns from scratch even if similar semantic patterns were seen before. It cannot develop a unified grammar model. The RNN embeds language rules as reusable transition functions in the shared weights. So they are retained and applied regardless of input sequence length or order. This means for sequential data, using RNN can result in simpler models with fewer parameters to train while having better performance compared to a traditional NN.
III. The Maths
To understand better how RNN works internally, let’s take a look at the maths behind it.
This is a more detailed view of the unfolded network with multiple variables to be used in the formula calculation.
1. Variables
W
: Recurrent weight matrix — weights connecting hidden statesU
: Input weight matrix — weights connecting input to hidden stateV
: Output weight matrix — weights connecting hidden state to outputs
: Weighted sum of previous hidden state and current inputa
: Hidden state activation valueo
: Weighted output of activation valuey^
: (“y hat”) Predicted value (normalised probabilities)y
: True target valueL
: Loss valuet
: Time stepb
: Bias matrix for hidden statesc
: Bias matrix for outputx
: Input vector
2. Forward pass
At each time step, we calculate the weighted sum of the current input (x[t]
) and the previous hidden state’s activation value (a[t-1]
), then passing it through the non-linearity activation function tanh
so that model could learn non-linear relationships. Here, W
and U
are weight matrices that jointly transform the previous state and current input, and bias b
provides bias offsets.
Once we’ve got the activation value a[t]
, output of this time step is equal to the weighted activation (with weights V
) and bias c
. For classification tasks, the output here may commonly represent logit (unnormalised, raw prediction). Softmax function is then applied to logits as a post-processing step to create y-hat
— normalised probability distribution of the output.
When you start at time step t=0
, the previous hidden state’s activation a[t-1]
should be initialised to be an all-zero matrix to avoid index out of range error.
This is an example of the data flow — we’re trying to predict the next character given a sequence of character in alphabetical order.
Inputs are discrete character, which are converted to one-hot encoded vectors. The hidden layer and output layer conduct calculations as mentioned in the previous formulas. Based on the output of the softmax function, we can convert that back to the predicted word. Each time step also has a predicted output. The y-hat
value at the last time step is your final prediction of the next character of the input sequence.
One-hot encoding
One-hot encoding is a way to express a discrete element (e.g. a word) numerically. One-hot encoding could also be thought of as probabilistic distribution with mass around 1 single value. For our inputs, it’s common to use one-hot encoding, however, you could choose other representations/techniques as well (using pre-trained word embeddings, adding embedding layer, integer index representation, etc)
In the coding example, I’m using the English alphabet as the vocabulary (known size 26) and the RNN is supposed to predict the next character given any random sequence within the alphabet, so I’ll be using one-hot encoding for simplicity as we know the total number of representable characters and since our prediction is also in the format of probabilistic distribution (because of softmax function), it makes sense for the input to be so as well.
For example, for English letters A-Z with vocab size 26:
- ‘A’ becomes [1,0,0,0,…0]
- ‘C’ becomes [0,0,1,0,…0]
Activation function
tanh
appears to be a popular choice for activation function in RNN so this is what we’re gonna use in this tutorial. There are discussions that you can read more about tanh
having larger derivative (compared to sigmoid), which means we can reduce loss function and converge faster.
Loss function
For training the parameters of our RNN model, we need a loss function to measure how well the predicted character probabilities match the true sequence. Cross-entropy loss is a common choice when working with categorical distributions. Cross entropy loss calculates the difference between 2 probability distributions (predicted values and true target values). Since the network consists of multiple time steps, we could calculate the loss at each time step and then the overall loss is the sum of all steps’ losses.
By comparing the predicted and actual character distributions at each sequence position, cross-entropy loss provides a gradient signal to update network parameters towards outputs that better match the desired targets. Computing losses timestep-wise is needed for proper attribution during backpropagation later. Our goal is to minimize this objective function to increase prediction accuracy over character sequences.
3. Backpropagation Through Time (BTT)
For the unrolled computational graph, similar to how the sequence flows through the network layer by layer in the forward pass, we calculate the gradients in backward pass layer by layer but from the last time step towards the initial time step (reverse order) instead, hence, the name “through time”.
Zooming into the time step t
as an example to derive the formulas to calculate gradients which will be used to update the shared weights and biases. The idea is that if we want to minimise the loss, we need to know how much we should “nudge” the parameters for each time step so as to impact the loss negatively (decreasing it) and by how much.
Because the parameters are not directly contributing to the loss function, but are in multiple nested function, to find the derivatives we need to utilise chain rule and sum rule extensively and start working from the loss function down to the function containing the weights and biases.
Starting from the loss function, since
using the derivative of natural log, the derivative of loss function w.r.t the predicted value is then
Moving down to the derivative of loss w.r.t the weighted output, using the derivative of softmax formula, we get
Gradients of V and c:
With the gradient of the weighted output in mind, we can calculate the gradient of its bias c
using the chain rule
and weight matrix V
also using the chain rule
Gradients of U, W and b:
Before the gradients of these parameters are calculated, some intermediate values need to be computed.
Since activation of step t
contributes to output at step t
as well as the activation of the next step t+1
, we need to add the gradients of the current output with the next time step’s loss to get the total gradient of the current activation.
The gradient of the current activation w.r.t current output
and the gradient of current activation w.r.t next time step’s loss
adding both together
For the gradient of the weighted sum of current time step, we use the derivative of tanh
Now we can find the gradients of W
, U
, and b
That’s all of the gradients that we need to update the weights and biases! Before we do that, notice how
similarly,
In the code, we’ll be calculating this for every time step t
and store it in a variable so that it can be used in the next time step t-1
for calculation.
The gradient of current step’s loss w.r.t previous activation is
Updating all parameters based on the gradients of the current time step t
4.Gradient vanishing/exploding
Because we’ll be repeating the weight update logic for every time step, there’s a high chance that we might run into the notorious gradient vanishing/exploding problem with RNN. Repeated multiplication either large or small values can cause the gradients to exponentially increase (explode) or decrease (vanish).
For vanishing gradient, the gradient just becomes too small which could lead to loss of dependency information as it gets to the initial time steps and slow convergence. On the other hand, exploding gradient or gradient getting too big could make the optimisation very unstable and loss function bouncing off back and forth away from global minima.
We can replicate this when running the code, assuming
- A high learning rate: 0.5
- High number of hidden layers: 128
we can observe the symptom of exploding gradient which is our loss’s continuous, rapid growth.
epoch=0
Loss: [138.87545009]
epoch=1
Loss: [1615.33636469]
epoch=2
Loss: [2980.75967274]
epoch=3
Loss: [nan]
epoch=4
Loss: [3193.67601878]
epoch=5
Loss: [nan]
epoch=6
Loss: [nan]
epoch=7
Loss: [nan]
epoch=8
Loss: [nan]
epoch=9
Loss: [nan]
There are acouple of ways to deal with this, we can
- reduce the learning rate
- reduce the number of hidden layers
- gradient clipping
Reducing the learning rate might help but it will be at the cost of really slow convergence. Reducing the number of hidden layers can also work but it is at the cost of not being able to capture long-term dependencies which could hinder improvement. Gradient clipping is one of the most popular approaches to solve this issue with high accuracy and low effort to implement.
There are other architectures that could further address this issue more efficiently such as LSTM which we will cover in the future.
IV. Implementing a Vanilla RNN with Numpy
Armed with the mathematical foundations, we now have all the pieces needed to implement our own RNN architecture in code for sequence predictions.
As mentioned, we’ll train the network to model English alphabet characters — feeding it subsequences and predicting probable next letters. Our vocabulary is the 26 letters A-Z, so vectors will be length 26 after one-hot encoding each input character into a vector.
This is what the sample data for training looks like
I created an utility function that converts the list of strings into a list of one-hot encoded vectors
Each input sequence has 26 characters and each character (e.g “A”, “B”) will become a list of 26 items, with the item matching its index in the alphabet equals to 1 while the rest are 0. So each input sequence will have the shape of (26, 26, 1).
Input Layer
Let’s start with the InputLayer
class.
Variables
inputs
which is the sequential data in the form of numpy arrays.U
is the weight matrix connecting input to hidden layer.delta_U
is the gradient calculated during BPTT.
Functions
get_input
: return the one-hot encoded vector of the character at a given time stepweighted_sum
: return the result ofU . x[t]
to be used in
calculate_deltas_per_step
: calculate the gradient ofU
at a given time step. The formuladelta_weighted_sum @ self.get_input(time_step).T
maps to this formula:
update_weights_and_bias
: updating the parameters using the gradient
Hidden Layer
Variables
states
: Stores activation of all time steps (internal memory of the network)W
: recurrent weight matrixdelta_w
: gradient of W during BPTTbias
:b
in the math formulasdelta_bias
: gradient ofb
next_delta_activation
: stores the derivative of next step’s loss function w.r.t current activation, from this formula
Functions
get_hidden_state
: return the hidden state value at a given time step. If time step is less than 0, default to all zeros matrixset_hidden_state
: updating the state at a time step after forward pass calculation.activate
: forward pass calculation
calculate_deltas_per_step
: compute gradients ofW
andb
update_weights_and_bias
: updating the parameters using the gradient
Output Layer
Variables
states
: Stores predictions of all time steps (internal memory of the network)V
: output weight matrixdelta_v
: gradient of V during BPTTbias
:c
in the math formulasdelta_bias
: gradient ofc
Functions
predict
: forward pass to calculate the weighted output and probability distribution with softmax
get_state
: return the output state (prediction) value at a given time stepset_state
: updating the output state at a time step after forward pass calculation.calculate_deltas_per_step
: compute gradients ofV
andc
update_weights_and_bias
: updating the parameters using the gradient
RNN
Variables
hidden_layer
: An instance of the hidden layeroutput_layer
: An instance of the output layerinput_layer
: An instance of the input layeralpha
: The learning rate
Functions
feed_forward
: Forward pass — iterating through the input sequence and coordinate functions from other layers to compute prediction at each time stepbackpropagation
: Backward pass — iterating through the sequence but in reverse order, calculating and updating the gradients for all parameters and eventually updating weights and biases.loss
: Cross-entropy loss functiontrain
: Run through the number of epochs given, and iterate each input sequence with both forward and backward pass and calculate the loss.
Main.py
The main file consists of some mock data for sample training and testing. An RNN instance is created and train with the one-hot encoded vectors and then tested with a sample new_inputs
. This is what the output on terminal looks like
And that concludes our vanilla RNN implementation using only NumPy and basic data structures! While math-heavy, fully understanding the equations governing recurrent networks equips us to architect them from scratch before reaching for libraries.
IV. Implementing a Vanilla RNN with Tensorflow
Now as a contrast, let’s showcase an implementation powered completely by TensorFlow with just a few lines of high-level code:
As you can see the implementation is now much shorter and all we need is a few configuration for the network. TensorFlow provides optimized RNN building blocks, abstracting away most mathematical operations discussed earlier into simple layer construction. This facilitates incredibly quick prototyping and development at scale, but might not be the best for learning.
The details of Tensorflow can be found on its documentation so I won’t bother you with that here.
In this post, we have covered core components of simple RNN architectures, tracing mathematical computations from input sequences to loss calculations, and backpropagating gradients through time. By grounding up RNN fundamentals and contrasting raw NumPy and TensorFlow approaches, it is my hope you feel empowered to build and tweak these models towards your own sequential datasets without being intimidated.
As we wrap up, I encourage you to take these learnings forward — play with network hyper-parameters, implement more complex gated units like LSTMs (I know I will), and scale to real-world time series data. Immerse yourself by coding along instead of just reading. The journey towards mastery requires getting hands-on experience and building intuitive understandings that come from experimenting yourself.
Happy coding!