When I decided I wanted to understand neural networks (NN) I thought that each new NN was crafted from scratch with a unique structure. At the time I was disappointed to find out that NNs (outside of research looking for novel algorithms) are generally built using specialty libraries like Tensorflow and Pytorch, which allow complex neural networks to be set up in just a few lines of code. At the time it appeared to me that this took all the fun out of it.
Since then I have learned that it would be absurd to code each NN from scratch. Partly because it would be pointless to rewrite code that could be generalized for later use, and libraries like Tensorflow utilise CUDA, which allows the multithreaded GPU to perform efficient matrix computations.
Even so, I still wanted to code my own NN using (nearly) base python.
For this project I wanted the code to be as simple (so no complex class hierarchys!) as possible just to prove to myself how mechanical these algorithms were. The only exception was that I used Numpy so that the program didn’t run for longer than the universe has existed.
I went with the classic and simple dataset of handwritten digits as I’m trying to break and paradigms or predict the future. I just wanted my code to work as I intended. That means I had 400 parameters corresponding to the greyscale pixel value for every pixel in each image, of which there were 5000.
The only complexity I added was redundant hidden layers. For the task of identifying digits I could have got away with 1 hidden layer. But this is an experiment so why not have 4 layers with may more weights than I could possibly need. Also, strangely, it did not increase the complexity of the code at all.
I started by defining the activation functions. I chose Relu for the hidden layers and sigmoid for the output layer with softmax:
Next I wrote a function to initialize random weights:
and then I initialized the weights and biases:
next the cost function:
forward propagation actually came down to a single line of code that needed to be repeated for every layer
The thing that made me nervous was backpropagation, I never feel confident that I have fully grasped the concept. However, coding it this way showed me that it is essentially the same 4 lines of code for every layer:
then use the derivatives I calculated in backward propagation:
And finally, put it all together in a surprising small number of lines:
And it worked!
So. The code works. Well, the cost is reducing… I could let this run for another 5000 iterations to make sure the optima has been reached. However, it took 5 minutes to learn this much. Realistically, this is not a good way to build a model. Bugs are no easy fix, optimizing the algorithm is way more complicated than it needs to be, and crucially every iteration takes an age. So in conclusion, I'm really really glad we have libraries like Tensor flow today. And I look forward to trying to master it in the future
Update: getting it to learn
I decided that I actually wanted this to work. So I updated a few features. After a few tried I realized that I was experiencing vanishing gradient. I therefore had to update my weight initialization function. Oops!
I also reduced the number of weights to a sensible number so that I each pass wasn't unnecessarily long
And removed the softmax function as it wasn't helping anything.
lastly I reduced the regularizing function.
This is a reminder that even if the implementation of neural nets is relatively straightforward, getting them to actually learn is a little more finicky and requires some tinkering.