Neural network from scratch

Published in

Analytics Vidhya

8 min readAug 25, 2020

Nowadays neural network is a big name in the tech space like Darth vader is in star wars. A lot of techies do admire the beauty of neural nets but at the same time treat it like a black box. Things going in and out. So I decided to write this article to explain the intuition,theory and mathematical structure behind neural networks for better understanding.

The idea behind neural network is the mapping of neurons inside a human brain. Human brain has huge no of neurons (86 billion according to a recent study) that it uses to perform wide variety of tasks. The brain has three main parts: the cerebrum, cerebellum and brain stem.

Cerebrum: is the largest part of the brain and is composed of right and left hemispheres. It performs higher functions like interpreting touch, vision and hearing, speech, reasoning, emotions, learning, and fine control of movement.

Cerebellum: is located under the cerebrum. Its function is to coordinate muscle movements, maintain posture, and balance.

Brain stem: acts as a relay center connecting the cerebrum and cerebellum to the spinal cord. It performs many automatic functions such as breathing, heart rate, body temperature, wake and sleep cycles and swallowing etc.

For a majority of researches, ‘Why’ has always been an important factor and same follows for neural nets. ‘If we have a map of neurons in our head then why can’t we implement it digitally’. This lead to the inception of neural networks in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers who moved to MIT in 1952 as founding members of what’s called the first cognitive science department.

From here we’ll dive deep into theory & mathematics behind these networks.

What is neural network ?

There are lot of definitions out there emphasizing on the resemblance of a digital neuron(perceptron) to a human neuron so instead of going there I’ll try to keep it fresh and more mathematical.

Neural network is a set of algorithm that takes a mathematical quantity as an input, processes it and yields an output while continuously updating weights and biases (between and with neurons respectively) so to reduce the uncertainty of prediction.

Applications of Artificial Neural Networks:

Anomaly detection
Speech recognition
Classification of data
Time series analysis
Computer vision
And many more.

Mathematical structure of a neural network:

A neural network like any neuron is divided into 3 stages: Input stage, Processing stage & Output stage.

Perceptron: A perceptron is a neural network without any hidden layer. A perceptron has only an input layer and an output layer.

This network has an input layer, 2 hidden layers & an output layer. Input layer contain 5 perceptron, hidden layers has 5(3+2) perceptron and output layer has 1 perceptron.

A Neural Networks consist of the following components:

An input layer, x
An arbitrary amount of hidden layers (2 in this case)
An output layer, ŷ
A set of weights and biases between each layer, W and b
A choice of activation function for each hidden layer, σ.

A simple neural network class based on theory :

A class with input x, output y, 2 weight matrix & 1 output prediction variable.

Theory behind neural networks:

Structure of a neural net is already explained above. Now let’s focus on how this arrangement works.

Training the neural network: The process of fine-tuning the weights and biases of the network for the input data is known as training the Neural Network. This training involves 2 parameters. These parameters are mentioned below:

The backbone of any neural network is feedforward process and backpropagation. These two combined gives one of the best predictions in machine learning landscape.

Feedforward process: The process of (adding a bias, multiplying by weight) to an input value & passing the value obtained through an activation function at each layer in order to predict the output is known as feedforward process. In simple terms, Calculating the predicted output ŷ, is known as feedforward.

Illustration of the process for a 2 layer system

The output of a 2 layer neural network is :

For a 3 layer or n-layer system, the equations can be extended by adding weights and biases to the output of (n-1)layer i.e previous layer.

In the expression above, there’s a function that’s applied to the weighted input at the perceptron and that function is known as activation function. In our case, we’ve used the sigmoid function that limits the output between 0 & 1.

Backpropagation: The continuous process of updating the weights and biases of perceptrons is known as backpropagation. Backpropagation is used for error reduction in the output prediction. Inputs are constant so our only way to reduce error is to update the weight and bias matrix in order to obtain a desired value.

Edit: The input values are in the form of matrices and all mathematical operations on these matrices are carried out in a bit-wise sense i.e element by element. The equations described above & below is for a scalar value. In practical cases, the same equations do operate on each element of input matrix.

Anyway, let’s get back to the topic. We need to update the weight and bias matrix but with what values are we going to update it? here comes the cost function into picture.

Cost function: It is a function that measures the performance of a Machine Learning model for given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

The goal of a predictive or classification system is to minimize the cost function so to attain higher accuracy.

In our case, cost function can be described as the difference of true output (Y) value & predicted output(Y_cap) value.

In this example, we’ll use the Sum of squares error as our cost function but I’ll also add codes of different cost functions that I’ve written.

Sum-of-squares error is simply the sum of the difference between each predicted value and the actual value. The difference is squared to get the absolute error value.

Now that we’ve find the error, we need a method to back disseminate the calculated error in order to update the weights and biases. Now we’re going to need calculus especially The chain rule, The rule that helped us to find the gradient(or the change) in one quantity w.r.t other while having intermediate dependencies.

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.

Now if we wanted to update the weights & biases, we need to figure out a way to check the response of loss function w.r.t the weight (w) and for that we need to apply the chain rule because there’s no direct relationship between the loss function and the weights of network.

Gradient descent algorithm tries to find the local minima on the graph. local minima is that value of weight for which the value of cost function is minimum i.e higher accurate results can be predicted.

Here we have a result or derivative of loss function w.r.t weight which helps us in backpropagating the error & updating the weights & biases.

Now the theory’s been explained, we should go for implementation in python. So here I am adding 2 more functions into the neural net class.

For more visual insights in neural networks, do watch this playlist.

Wrapping it up

Now everything’s almost done so we should try to apply the basic neural net model to some input values.

X=[[0,0,1] [0,1,1] [1,0,1] [1,1,1] ] #Training set inputs

Y=[0,1,1,0] # Training set outputs

As the dataset is small, we should train the model i.e train feedforward & backpropagation for several time. let’s pick 2500 as our iteration rate.

The error per iteration is decreasing monotonically towards the minimum. Over-iterating can result in increase in the errors as model fits too tightly for the training data & hence will perform poorly for the testing one.

It’s done!!! Our model is working perfectly. After 2500 iteration, it trained the model & fine tuned the weights & biases which gives very close prediction for our target values(Y).

Extras

In this section, we’ll explore different activation and cost functions.

Activation functions

Activation functions are the functions which limits the value of a weighted & biased input so that the value doesn’t become ambiguous & system can process it. They limit the function and the output obtained is pitched to the next layer as an input.

In majority of cases, neural network practitioners use sigmoid & Relu (Rectified linear unit) functions but there are many other to explore.Keeping this in mind, I’ve written several activation functions which can be used in different circumstances. Here are a few of them.

**different types of activation functions**

I don’t find it suitable to discuss the mathematical formulae of these activation functions here as it’ll increase the length of this article which is already elongated so I am posting this link which helps to understand activation functions.

Cost function

As I’ve already discussed what cost functions are, I guess we should go for different types of it. Here’s a good source to understand different type of cost functions- Source.

Some cost functions that I’ve written are mentioned below:

What’s next

There are a lot of things to learn and understand in the neural network space like convolutional networks for image classification, generative adversarial networks and many more. I’ll write about them as well in future.

It’s always better to go beyond the bar and try to understand the algorithms running deeper behind black boxes.

It is literally the case that learning languages makes you smarter. The neural networks in the brain strengthen as a result of language learning.

The End