Building a Neural Network Zoo From Scratch: The Recurrent Neural Network

Gavin Hull
9 min readNov 1, 2022

--

Visualization of the Vanilla Recurrent Neural Network from Asimov Institute.

Recurrent Neural Networks (RNNs) are a class of deep learning architectures that aim to represent sequences of data. There are many different RNN architectures, some of which will be covered in future articles, but in this article I will be focusing on the Vanilla RNN, the original model. I suggest giving my previous articles a read if you are aren’t comfortable with deep learning architectures in general.

Despite their impressive results, Recurrent Neural Networks have had a rather checkered history. Much of their early success came from the work of David Rumelhart in 1986 (much like Multilayer Perceptrons), and they were some of the first models to achieve super-human performances in the field of deep learning. However, much of the RNN’s popularity is due to its adaptations, such as LSTMs and GRUs, rather than from its original design.

What are Recurrent Neural Networks?

Recurrent Neural Networks are sequence-based learning models, which means they’re used to predict the next event in a sequence of data. They were inspired by the way the brain works as it forms and recalls memories by continuously analyzing previous events in order to decide how to act in the present.

The magic of the RNN is the hidden state. The RNN’s hidden state represents the information that the model would like to remember in the future. At each time step, the RNN takes a weighted sum of the inputs and the hidden state, until the final time step where it calculates and returns a final output.

Rolled v.s Unrolled RNN.

The above photograph shows how the two ways to think of RNNs: rolled and unrolled. In the rolled model, you can clearly see the input, hidden, and output layers, and how the hidden layer connects to itself. In contrast, the unrolled model is useful because it shows the intermediary steps of the network. The layers remain the same, but you can see the process over time. In the unrolled model, the network recieves X₀ and returns Y₀ at the first timestep. The network then passes a hidden state to itself in the second timestep, which uses X₁ and this hidden state to calculate and return Y₁. The same process is repeated until the network has calculated and returned the final output, Yₙ. This means that each timestep of the network contains information about all of the previous timesteps, as well as its own. It is for this reason that RNNs were originally designed to solve the “amnesia” of Multilayer Perceptrons. However, due to their structure, RNNs are still prone to the vanishing/exploding gradient problem. This is when the error of the network layers approach 0 (or infinity), and therefore stop contributing to the output of the network.

How does it work?

There really isn’t anything exceptional about the forward pass of the Vanilla RNN. For each time step, we generate the next hidden state with the formula below, where HS and x are the hidden state and input at time step n.

Formula for Hidden State.

Once all time steps have been processed, the final output of the network is calculated using the following, where HS-1 is the final hidden state.

RNN output function.

The backpropagation algorithm is also similar to that of the previously discussed networks, the difference being that the error has to be calculated back through each time step. This is called backpropgation through time.

Computational graph of the hidden layers of an RNN.

Zoom in to see the text more clearly!

As I’ve explained in previous posts, the error at any given step can be calculated as the derivative of the upstream function with respect to the current function, multiplied by the upstream error. Here is a diagram to clarify:

Diagram showing the construction of a computational graph.

Using these diagrams, we can calculate the following:

Error of the input at timestep n.
Error of the W1.
Error of the hidden state at timestep n.
Error of W2.
Error of b2.

Unlike regular backpropagation, using backpropogation through time, the error of each variable is calculated as the sum of the errors at each time step. This will be demonstrated more effectively in the code. Then, the error of the final layer can be calculated using the following diagram:

Computational graph of the output layer of an RNN.

Finally, using the same techniques as before, the error of each variable is calculated as:

Error of b3.
Error of W3.
Error of the final hidden state.

With the prerequisite math out of the way, let’s look at the code.

The code!

As always, the only package we will be using for the network will be NumPy. I also use a package called tqdm to add a progress bar to the training stage, but that is a personal choice.

While RNNs can be used for lots of different problems, one of the most interesting (at least to me) is natural language processing. For this article, we’ll be training our RNN to determine whether a sentence is happy or sad.

The train_X and train_y lists hold our training sentences and labels, where 1 is happy and 0 is sad. In the same way, test_X and test_y hold the test data. Finally, we create a set called vocab which holds all of the words used in our training and testing data, and create a dictionary called word_to_index which we will use to encode our language data into numbers, which the network can better understand.

Next we define a function called oneHotEncode that takes a sentence, text, and returns a vector of 1s and 0s, which represents the input text. Neural networks only take numbers as inputs, so one hot encoding translates between our language and the language of the neural net. I’ve also defined a function called initWeights in this section, which will be used to create the weights for each layer. I’ve used a Xavier Initialization here, mostly due to preference. Simply put, initializations are a way of changing the distribution of the randomness of a neural network. If you’re interested in the math behind this, I’d recommend this article.

The activation function for this network will be tanh(x), whose derivative can be calculated as 1-tanh²(x). However, you may notice in the code that it is defined as 1-x². This is because we will store the values of tanh(x) in our network, instead of just x, so that when we call the activation function, we need only to square it and subtract it from one. This will make the code more readable, and prevents us from needing to store too many values over time. We will also be using softmax in our error calculation, which is defined as:

Definition of Softmax for some vector x.

Now to initialize our network. Our RNN class defines three weights (w1, w2, and w3) using our initWeight function, and defines two biases. Notice that there is no bias for our first layer. This is because the hidden state is calculated as W1(x) + W2(HS) + b2, and adding a second bias to this function doesn’t inherently change it. Having said that, if you were to add a b1 term, this wouldn’t change anything, so if you want to do so for consistencey, go ahead.

As mentioned earlier, the forward propogation function is very similar to that of single and multiple layer perceptrons. The difference is that we will forward propogate for every input given (in this case, for every word in the input sentence). Initialize a hidden state and start iterating through the inputs. For each input, layer1_output is calculated as the input multiplied by the weights in the first layer, w1. layer2_output is the previous hidden state, hidden_state[-1], multiplied by w2, plus the second layer bias, b2. Our new hidden state (calculated as the hyperbolic tangent of the first layer output plus the second layer output) is then appended to our list of hidden states, and the process repeats for each input. Finally, the output is reached by multiplying the final hidden state with the output layer, w3, and adding the bias b3.

Backpropogation function.

The backpropogation function is going to look a bit different than that of the other neural networks we’ve looked at. We can find the error of the output layer (d_w3 and d_b3) as described above. The other layers we initialize as zero, and then iterate through the inputs to find, adding the error from each timestep. The error of bias 2 (d_b2) will be the sum of the errors of the hidden states for each input, and the error of weights 2 (d_w2) will be the sum of the hidden state multplied by the error of the hidden state for each input. Finally, the error of the first layer (d_w1) will be calculated as the sum of the input multiplied by the error of the hidden state for each input.

After calculating these error values, there’s one more step before we can update the network. As I mentioned previously, RNNs often suffer from the vanishing/exploding gradient problem, meaning that their errors either vanish to incredibly small numbers or explode to numbers close to infinity. Both of these scenarios are bad for the network. There isn’t much you can do to prevent a vanishing gradient. The exploding gradient, however, can be mediated using gradient clipping, which basically means making any gradient larger than n equal to n. In this case, I’ve set my clipping to 1, but feel free to play with that.

Train & Test function.

The train and test functions are nothing new. The train function iterates inputs and labels (using tqdm to display a progress bar while doing so) and both forward proprogates and backpropogates the network. The error function used is softmax, which essentially makes the output easier to interpret for the network. The test function is similar, iterating the inputs and labels provided and forward propogates. The difference is that it prints the network’s predictions so that you can see if the network is improving and then updates the accuracy variable so that we can get a percentage accuracy when completed.

Finally, we initialize the network, train, and test! I’ve used a hidden size of 64, a learning rate of 0.02, and 1000 epochs, but as always, I highly recommend playing around with these hyperparameters to get the best results. Also, try adding some new training and testing data and see what the network can and cannot understand. The best way to get a grasp on neural networks is to play with them and discover them for yourself.

This concludes the third article in my “Building a Neural Network Zoo from Scratch” series. I hope you have enjoyed reading this article. If you did, I’d appreciate it if you could share this article with your friends and colleagues. As always, the full code is available on GitHub.

A big thanks to Emily Hull for her help editing this article.

--

--

Gavin Hull

I am a second year Computer Science & Pure Math student at Memorial University, I have been programming for ~7 years and I have a penchant for AI.