Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Building an Artificial Neural Network using pure Numpy

6 min readNov 11, 2018

--

Press enter or click to view image in full size

Neural networks have been the hype due to their stellar performance in several domains like computer vision and natural language processing. Not only do they outperform their peers by huge margins, but they are also extremely versatile and are used in almost every field imaginable.

But what the hell do these two words mean?

In this brief post, we’ll do a deep dive into the concept of neural networks and then code our own in Python using pure NumPy to classify MNIST digits (It’ll be fun, I promise). I’ll try to keep it as short and concise as possible, primarily to prevent you from closing this tab due to my unprofessional writing style. So let’s jump straight in:

What the hell is a Neural Network

As the name suggests, A neural network is a computational system inspired by the biological network in our brains. If you didn’t sleep throughout your biology classes like me, you might remember that the network in our brains consists of a crazy amount of neurons.

Press enter or click to view image in full size

For our purposes, we can model this neuron as a function, which takes in a bunch of inputs, gets a weighted sum of these inputs using some weights, adds a bias, and outputs a number based on some activation function. Makes sense? I thought so too lol.

Mathematical working of a single neuron

The weights can be thought of as a bunch of knobs that we can tweak to get different outputs.

The bias is another knob that decides when a neuron stays inactive or in other words, it decided how high the weighted sum needs to be for the neuron to be meaningfully active.

The activation function is a function that maps the arbitrary output of the logit function to any specific range of values. It’s usually used to add some non-linearity to our model. This allows the network to combine the inputs in more complex ways and in turn provide a richer capability in the functions they can model. Examples of most commonly used activation functions are sigmoid, softmax, ReLU, tanh, etc.

Structure of the network

Now that we know how a single neuron works, we can connect them to form a network in the form of layers. So an artificial neural network is just an overrated composite function.

Press enter or click to view image in full size
A simple neural network. Also called a multilayered perceptron

A typical neural network consists of 3 types of layers:

  1. The input layer: The given data points are fed into this layer. There can be only 1 input layer. The number of neurons in this layer is equal to the number of inputs.
  2. The hidden layers: This is the meat of the whole network. These are the layers that try to find patterns in the inputs to get the outputs we need. A network can have any number of hidden layers.
  3. The output layer: This layer gives us the predictions of the network, ie. the outputs that the network thinks should be correct given its current parameters (weights and biases each neuron). The number of neurons in this layer is equal to the number of values we need to predict. Since our task is to classify MNIST digits, we will have 10 neurons as there are 10 digits to compute predictions for.

Therefore a basic network layer can be defined as:

The layer processes the input to finally produce an output value. This is called a forward pass on the layer.

In our implementation, we will be using two layers:

  1. Dense Layer — Where each neuron in a layer is connected to every neuron in the layer just after it.
  2. ReLU Activation Layer — A layer that sits on top of a Dense layer which applies a ReLU activation function to the outputs of the Dense layer. I could have used the most common sigmoid function, but I try to be edgy at times, so I’ll go with the ReLU function.
ReLU activation function

These 2 layers can be defined as:

Training the network

So now that we have defined the architecture of the network, how in the world do we train it? Yes, using the parameters, i.e. the weights and biases.

Since our network has more than one neuron, and each neuron has a unique set of weights and biases, this gives us thousands of knobs to tweak. If you are a little masochist who would voluntarily want to tweak these thousands of knobs by hand to get the best possible combination, then go right ahead. If you are a normal person, then we can utilize the gradient descent algorithm.

Press enter or click to view image in full size

Gradient descent is a general algorithm that can be used to optimize any differential function. The way it works is that it calculates the gradient of the function at the current point. This gradient gives us the direction which would maximize the function (gradient accent). But we usually need to minimize a function, so we reverse the direction of the computed gradient to get the direction to minimize the function (gradient descent). If you are a little slow like me, you can visualize it as a ball rolling down a hill, which ends up at the lowest point due to gravity.

To optimize our network we need such a function, so we define a loss function — The one which we use is called the log softmax cross-entropy loss (being edgy again).

A scary-looking loss function

Let’s break it down word by word:

  1. Softmax

Softmax function takes an N-dimensional vector of real numbers and transforms it into a vector of a real number in the range (0,1) which adds up to 1. Thus it outputs a probability distribution which makes it suitable for probabilistic interpretation in classification tasks.

Softmax function
Graph of Softmax function

2. Cross-Entropy Loss

Cross-entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution is.

Cross-Entropy Loss

Let’s write it in code:

So the way we train the network is the following: The output of the network is compared to the expected output and the loss is calculated. This loss is then propagated back through the network, one layer at a time, and the weights and biases are updated according to the amount that they contributed to the error. This propagation is carried out by the backpropagation algorithm.

Now, this algorithm is pretty complex and would require a whole dedicated article to be explained, so I’ll just tell you the minutes because I am lazy.

So for every layer, to calculate the effect of the layer’s parameters on the total loss, we need to calculate the derivative of the loss with respect to these parameters. To ease our troubles, we can exploit the chain rule.

Backpropagation Cheat Sheet

So for every layer, we can add a backward pass, where the gradient for the layer after it is taken as input, used to calculate the required derivatives, and finally, the gradient of this layer is returned as the output:

Running the code

The whole code with an accuracy plot is given below

Accuracy Plot

Thanks for sticking around till the end. Writing this article sure helped me solidify some of these complex concepts in my brain. This is the first time I have tried to write a technical article, so if anyone has any pointers, leave a response below!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Rohit Agrawal
Rohit Agrawal

Written by Rohit Agrawal

Software Engineer @ FordPro Charging. Passionate about Programming, Guitar, and Product development — Work with me: https://rohit.build

Responses (3)