Building a Neural Network Zoo From Scratch: Feed Forward Neural Networks

Gavin Hull
7 min readOct 8, 2022

--

Visualization of a Feed Forward Neural Network from the Asimov Institute.

Feed Forward Neural Networks are a broad class of neural networks that encompass any network that moves in a forward direction (i.e. most of them). Therefore, for the purposes of this article, we’ll mainly be focusing on the simplest of them: Multilayer Perceptrons.

If you haven’t read my previous article on single-layer Perceptrons, I’d highly recommend doing so before you continue.

There’s a reason the 70s are referred to as the “dark ages” of AI: with little to no progress on neural networks since Frank Rosenblatt’s paper in 1958, the prospect of neural networks had become somewhat of a joke in the cognitive research community, with some even going so far as to provide formal proofs as to why neural networks could never work. However, in 1985, twenty seven years after the original paper, a small study group in UC San Diego presented this paper and forever changed the world of AI.

Single-layer Perceptron vs. Multilayer Perceptron

Truth be told, “Multilayer Perceptron” is a bit of a misnomer. This is because the crucial thing that this study group proposed (the real difference between single-layer Perceptrons and Multilayer Perceptons) isn’t the number of Perceptrons, but the non-linearities between them.

A non-linearity is exactly what it sounds like: a function whose graph is not straight. There are many non-linear activation functions nowadays, but the original one proposed by the study group was called sigmoid.

The formula for the sigmoid function.
Graph of the sigmoid function.

If you remember in my previous article, I compared a single-layer Perceptron to the equation of a line, and explained how this presented a problem when trying to seperate functions like XOR. Multilayer Perceptrons solve this problem by adding non-linear activation functions like sigmoid between layers to warp the ‘Perceptron line’.

One possible seperation performed by a Multilayer Perceptron on the XOR graph.

Unlike single-layer Perceptrons, Multilayer Perceptrons are still used today for tasks like sentiment analysis, weather forecasting and even basic image recognition, which is what we will be covering in this tutorial. This is due in part to something called the Universal Approximation Theorem which states that any sufficiently large neural network can approximate any continuous function f(x). This means that because of these non-linearities, given enough time and enough accurate data, a neural network can learn almost anything.

Okay, so how does it work?

Like the Perceptron, the Multilayer Perceptron has a forward pass and a backward pass. The forward pass remains relatively unchanged between these two networks, although it is longer due to the multiple layers.

Layer 1 function.
Layer 2 function.
Total network function.

As the two layers of the network are the same, with the exception of the w and b, we can imagine the network being split into two seperate ‘Perceptron functions’ as shown above.

The backward pass gets a little more complicated with these added non-linearities. For this reason, we’ll split our computational graph into two parts as well.

Computational graph of layer 2.

As before, we place each of our variables on the graph and forward propagate the current function at each step above the connections in green. Then, starting from the right, we backpropagate our errors in red: the last error will always be e and every other will be the upstream error (the error to the right) multiplied by the derivative of the upstream function (the function to the right) with respect to the current function (the function above the connection).

Error of W2x+b2.
Error of b2.
Error of W2x.
Error W2.
Error x.

Note that as this is the second layer of the network, x will really be the output of the first layer. This also means that the error we backpropagate through the first layer, the e value, will be the error of the x we found in the second layer. It then follows that to find the error of the first layer, we just take the errors found for the second layer and replace e with our error of x.

This lets us skip the intermediate steps of wx + b, wx, and so on.

Error of b1.
Error of W1.

This will undoubtedly be confusing at first, but like anything, practice makes perfect. I’ll again draw your attention to this video from Stanford University that goes deeper into this process, if you’re still struggling.

Enough with the math, let’s code it up!

As mentioned in my previous article, for the sake of clarity I will only be using NumPy to implement the structure of these networks. In this particular tutorial, however, I will also be using Matplotlib to visualize the networks inputs. Therefore, go ahead and import NumPy and Matplotlib.

Import NumPy & Matplotlib.

A new addition to our code this time around is our activation function, sigmoid.

Activation function.

This function takes two inputs: input and derivative, which is set to False by default. input will be our actual mathematical input, whereas derivative will be a boolean value which tells our function whether to return the sigmoid of our input or its derivative (σ(x) or σ’(x)). If you’re curious as to how the derivative of the sigmoid function is derived, this is a great place to start, but for our purposes, it’s a negligible detail.

Multilayer Perceptron class.

Next we can define our Multilayer Perceptron class, which takes 5 inputs. input_size is the length of our input, hidden_size is how large we want our layers to be, output_size is the length of our output and num_epochs and learning_rate are our hyperparameters which change how long and how fast we learn respectively. In our initialization function, we also create the layers of the network, where w1 and b1 are our weights and biases for the first layer, and w2 and b2 are out weights and biases for the second layer.

Forward propagation function.

As mentioned previously, the forward() function is pretty straightforward: our layer1_output is W1x + b1, activation1_output is the sigmoid of that, layer2_output is W2x + b2 where x is activation1_output and activation2_output is the sigmoid of layer2_output. Notice that we use class variables to store these values instead of returning them from the function. This makes our network classes much cleaner, as we will need each of these variables for backpropagation.

Backpropagation function.

Now for backpropagation. Our backward() function takes error and input, which are the final error for the network and the original network input respectively. error2 is the error of b2, which is the error of the network times the sigmoid derivative of the second layer output. dw2 is the error for the weights on the second layer, which is equal to the error of b2 multiplied by the input of the second layer. Then dw1 and error1 are the errors for the weights and biases of the first layer, which are calculated as shown above. Note that all of the extra operations are only used to fix the sizes of the matrices to perform the nessecary calculations; .T transposes the matrix and .flatten() changes a (1, n) 2-dimensional matrix into an n length array.

Train & test functions.

The train() function for the Multilayer Perceptron is identical to the single-layer Perceptron and therefore requires no explanation. The test() function is also quite similar between the two Perceptrons, iterating through inputs and forward propagating to get a prediction. The difference is that this time we’ll use Matplotlib to visualize our input and determine manually whether or not our network is accurate.

Multilayer Perceptron initialization and utilization.

Finally to initialize our network. Our input_size is set to 30, because our network is being trained on 5x6 images. These images can be seen below.

Inputs for our network.

Our hidden_size is set to 5, (as always I would recommend playing around with some of these numbers to see how it affects the training of the network, although a good rule of thumb is for the hidden size of a network to be between the input and output sizes), the output_size is set to 3, as there are three possible classifications for each input, and num_epochs and learning_rate are set to 1000 and 0.1, again, arbitrarily.

That brings us to the end of the second article in this series. I hope you found this useful, or at least interesting. Feel free to share this article with your friends and colleagues and stay tuned for my next one. Full code for this article can be found here.

A big thanks to Emily Hull for editing.

--

--

Gavin Hull

I am a second year Computer Science & Pure Math student at Memorial University, I have been programming for ~7 years and I have a penchant for AI.