# Building Deep Neural Network from Scratch using python

This article is about building a deep neural network from scratch without using libraries like Tensorflow, keras or Pytorch etc. It consists of two sections. In the first part, We will see what is deep neural network, how it can learn from the data, the mathematics behind it and in the second part we will talk about building one from scratch using python.

If you are familiar with concepts of neural network feel free to skip the first part and straightway jump on to “Building a Network to identify Handwritten Digits” section.

# What is Deep Neural Network?

Before we actually jump onto what is an artificial neuron and neural network, let’s see how our biological neural network functions.

The biological neural network is a network of inter connected neurons. Each neuron has something called ** dendrites** which gathers information from the surrounding environment. The information comes to the neuron in the form of electric/chemical signals. Once a neuron gets the signal, it then process the signal and if it reaches certain threshold, it emits an output signal through an

**which is connected to the next neuron. The next neuron upon receiving the signal does the same and the process continues.**

*axon*** Artificial neural network (ANN)** is vaguely inspired by the biological neural network. It is a collection of connected artificial neurons. Just like biological neuron, artificial neuron also takes input from one neuron, does some calculations and transmits the signal to another neuron that is connected to it.

** Deep Neural Network (DNN)** is an artificial neural network with multiple layers between input and output layers. Each neuron in one layer connects to all the neurons in the next layer. The one or more layers between input and output layers are called hidden layers .

Each connection that connects a neuron from one layer to the neuron in the previous layer has something called weight **w**, which tells how sensitive that our current neuron’s activation to the activation of the neuron in previous layer. Each neuron in a given layer has something called bias **b. **If you are familiar with linear regression, the bias term acts like an interceptor “c” in ** y= mx +c** . If sum(mx) is not crossing the threshold but the neuron needs to fire, bias will be adjusted to lower that neuron’s threshold to make it fire.

The whole network looks very complicated right! But it’s not. Think of it as a giant function ** y = f(x)**, where x is your input, y is the output. Inside the function

*f(x)*, it then calls a chain of functions where one function’s output is passed onto another. These internal functions are nothing but hidden layers.

Now let’s zoom in to a single artificial neuron. An artificial neuron has two parts to it. In the first part, it takes the input from the previous layer, the corresponding weights, biases and then does a linear transformation of those. The linear transformation is nothing but the sum of weighted inputs and bias.

In the second part, it converts this linear transformation into a non-linear transformation, by using an activation function like *sigmoid **and emits the output of the activation function *. There are other various activation functions like ** ReLu**, but we are using sigmoid in this post. Because of this combination of linear and non-linear transformations along with multiple layers makes deep neural network so powerful so it can fit any complex data.

The sigmoid function takes weighted sum and converts the value between 0 and 1. It converts **-infinity **to 0 and **+infinity** to 1. The value between 0 and 1 represents the activation strength of a particular neuron.

The activation of a neuron at a given layer can be written as below

In a typical neural network, we will have more than one neuron in a given layer. The above equation can be represented in matrix form to include all neurons.

# Training Deep Neural Network

Deep neural network will learn from the given data by itself and will be used in predicting for the unseen data. *But what we mean by learning from data?*

As we already discussed, a DNN has set of weights and biases at each layer. The activation of a neuron is dependent on the corresponding weights and biases. So learning from data means finding out the best weights and biases of the network. *But how do we find the weights and biases?*

To find the weights and biases a deep neural network does the following;

- Assigns some random values to weights and biases
- Runs the training data (which has inputs and actual outputs) on the network using these randomly assigned weights and biases. During this the output of an activation function in one layer will be passed as input to the next layer until we get the output from the output layer. This process is called
.*forward propagation* - The initial output from the network will always be terrible since we have used random weights and biases. We compute the error (the difference between network’s prediction and actual output) by using some sort of a cost or error function. In this post we are going to use
.*Sum of Squared Errors*

4. Since all neurons in the network contributed for the error above, the error proportion (error gradient) will be passed back from output layer to all layers excluding the input layer so that the weights and biases can be adjusted. This process of propagating error to adjust the weights and biases is called *Back Propagation.*

Since cost function is a function of weights and biases, the error gradient will be calculated using the partial derivatives of the cost function with respect to weights and biases.

To understand better, let’s take a simple network with one input layer, one hidden layer and one output layer. After first pass of forward propagation, we will have the error. Now we need to pass the error proportion back to all neurons in all layers.

First, let’s calculate the error gradient for a small change in weights and biases at the output layer. For simplicity, let’s write the activation function as function of function.

It’s time to refresh our high school/college multivariate calculus and find the partial derivatives of cost function C with respect to both weight and bias. Using the ** chain rule**, the partial derivative of C w.r.t w and b can be written as follow.

The partial derivatives of each component in the above equation are

Substituting the above values of partial derivatives on to the chain rule equation, The error gradients at the output layer w.r.t weights and biases are

Now, let’s calculate for the hidden layer

Note: Although L and L-1 represents output layer and hidden layer respectively, I have used the sub notation of ‘o’ for output and ‘h’ for hidden layer to be more clear

Similarly we can calculate the error gradient at all hidden layers if we have more than one. Since we only have one hidden layer, the back propagation stops here.

5. The above forward and backward propagation will be done iteratively and weights and biases will be adjusted until we find the optimal values. Instead of doing this like a brute force approach, we will use the ** gradient descent** algorithm.

## Gradient Descent

Gradient descent is an iterative optimization technique that can find the *minimum* of a** **function. It is used when finding optimal values of a function parameters is difficult through algebraically.

** Intuition:** Imagine a person standing on the steep of a valley. The person wants to get to the bottom of the valley, but he doesn’t know which direction takes him to the bottom. He takes one step, decides the next position based on the current position. If the step he took is towards the bottom he will continue in that direction, otherwise he will change his direction. He takes larger steps when the slope of the valley is steep and as reaching to the bottom of the valley, he takes smaller steps. Finally stops once he reaches the bottom of the valley.

Our objective here is to find the optimum values of weights and biases such that the cost function is minimum. The following steps are involved in gradient descent algorithm

- Assign random values for weights
and biases*w**b**and a constant*value for learning rate - Update weights and biases by using the gradient (we calculated using partial derivatives) and learning rate.

3. Repeat the step 2 until we find the minimum value or it reaches maximum iterations

## Training Summary

let’s summarize the whole training process by writing a pseudo code for the network that has 1- input, 1- hidden and 1- output layer

initialize_weights_and_biases():

output_w = initialize_random_w

output_b = initialize_random_b

hidden_w = initialize_random_w

hidden_b = initialize_random_btrain(x_train, y_train, no_of_iterations, learning_rate):

# 1. initialize network weights and biases

initialize_weights_and_biases() for iteration in range(no_of_iterations):#Run gradient descent algorithm no_of_iterations times#initialize delta of weights and biases

wo_delta = initialize_random_w_delta

bo_delta = initialize_random_b_delta

wh_delta = initialize_random_w_delta

wh_delta = initialize_random_b_delta for x, y in zip (x_train, x_train):#Iterate through each sample in the training data

# 2.forward propagationhidden_w * x + hidden_b

z_h =

a_h = sigmoid(z_h ) z_o = output_w * a+ output_b

predicted = sigmoid(z_o)# 3.find the error

error = (predicted - y)# 4.Back propagate the error

delta = 2 error * sigmoid_prime(z_o)

wo_delta+= delta * a_h

bo_delta+= delta

wh_delta+= delta * output_w * sigmoid_prime(z_h) * x

bh_delta+= delta * output_w * sigmoid_prime(z_h)

# 5. after 1 pass of all the inputs, update the network weights

output_w = output_w - learning_rate * wo_delta

output_b = output_b - learning_rate * bo_delta

hidden_w = hidden_w - learning_rate * wh_delta

hidden_b = hidden_b - learning_rate * bh_delta

# Prediction

After training the neural network, we will have the optimal values of weights and biases at each layer. Prediction is nothing but performing one pass of forward propagation for the test data.

# Building a Network to identify Handwritten Digits

Enough of the theory, let’s get our hands dirty by writing a python program to build a deep neural network. We are going to use **mnist **dataset and build a network that recognizes hand-written digits, the** hello world program of Deep Neural Network**.

**mnist** data consists of scanned handwritten images of size 28 x 28 pixels.

Again we will consider building a network with **1 input layer, 1 hidden layer and 1 output layer**.

The following program is the python version of the pseudo code we discussed above. The only difference is we have introduced batch, because the mnist data has 60000 rows of data. Loading the entire 60000 rows in memory for every iteration is going to kill the memory.

** __init__** initializes the weights and biases randomly for output and hidden layers.

** forward_propagation** performs forward propagation for the given input

** update_mini_batch** runs forward and back propagation for every record in the given batch. We are doing sum of error delta is because we are using the sum of squared errors and the partial derivative is the sum of error gradient of all samples.

` o_del_b, h_del_b, o_del_w, h_del_w = self.backprop(x,y)`

o_b = o_b + o_del_b

h_b = h_b + h_del_b

o_w = o_w + o_del_w

h_w = h_w + h_del_w

After every batch run, it will update the network weights and biases

` self.o_weights = self.o_weights — (l_rate/len(batch))*o_w`

self.h_weights = self.h_weights — (l_rate/len(batch))*h_w

self.o_biases = self.o_biases — (l_rate/len(batch))*o_b

self.h_biases = self.h_biases — (l_rate/len(batch))*h_b

** backprop** propagates the error gradient back to all layers excluding input layer. It is the heart of the neural network. As we discussed earlier we will calculate the partial derivatives of error function with respect to weights and biases at each layer. In the code we have used

**method so that it follows the matrix multiplication rule (**

*.transpose()**A X B is only possible if A is a matrix of mXn and B is matrix of nXp. The result matrix will be mXp*).

` delta = (predicted - y) * sigmoid_prime(z_o)`

o_del_b = delta

o_del_w = np.dot(delta, a_h.transpose())

delta = np.dot(self.o_weights.transpose(), delta) * sigmoid_prime(z_h)

h_del_b = delta

h_del_w = np.dot(delta, x.transpose())

** fit** method trains the network. It takes the input, shuffles it randomly, splits the data into batches. Invokes update_mini_batch function for each batch. it does these steps for every epoch.

To read the mnist data, we are going to use ** fetch_openml** from

**package. We will use sklearn to split the data in to train and test**

*sklearn.datasets*The mnist data has digitized images of handwritten digits, so it will have values from 0 to 255. To normalize the data, divide the input with 255 so that the image distribution is between 0 and 1

`X = (X/255).astype('float32')`

Since each image is 28 x 28 pixel and deep neural network expects the input in vector format, the input is converted in to (784,1) shape, because 28 * 28 = 784.

`X = [np.reshape(x, (784, 1)) for x in X]`

The network we are going to build has 10 neurons in the output layer as we need to identify the digits from 0 to 9. If the network identifies a given digit as 3 then the output neuron that is meant for 3 would have a value of 1 and all other neurons will have a value of 0.

`[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]`

Since mnist dataset has y value in the form of digit, we need to vectorize it such that it will be in the form mentioned above.

Let’s create a network object by specifying the number of neurons at each layer and train the network with train data

Here we have network with input layer of 784 neurons, hidden layer with 100 neurons (why 100 neurons? its a choice, we can use any number of neurons and see how the network behaves) and an output layer of 10 neurons

`network.fit(train_data, 30, 10, 3.0)`

The above statement, splits the input in to 10 batches and runs for 30 iterations with a learning rate of 3. The number of iterations, batch and learning rate are hyper parameters of the network. We need to do hyper parameter tuning to find out the best combination.

## Accuracy

To find the accuracy of the model against test data, for every data perform the forward propagation, get the max value. If the value matches the y test that is correct prediction of the model. Sum of correct predictions over total test data gives the accuracy

For the network we built, the test accuracy is ** 96.59 %** which is very good.

The full program is available at my git repository

Note :This post is inspired from the book neural networks and deep learning by Michael Nielsen

I hope you enjoyed the post. Happy Learning !!