Building Deep Neural Network from Scratch using python

Naga Durga Sreenivasulu kedari
Analytics Vidhya
Published in
12 min readApr 16, 2021

This article is about building a deep neural network from scratch without using libraries like Tensorflow, keras or Pytorch etc. It consists of two sections. In the first part, We will see what is deep neural network, how it can learn from the data, the mathematics behind it and in the second part we will talk about building one from scratch using python.

If you are familiar with concepts of neural network feel free to skip the first part and straightway jump on to “Building a Network to identify Handwritten Digits” section.

What is Deep Neural Network?

Before we actually jump onto what is an artificial neuron and neural network, let’s see how our biological neural network functions.

Biological neuron. Image Source

The biological neural network is a network of inter connected neurons. Each neuron has something called dendrites which gathers information from the surrounding environment. The information comes to the neuron in the form of electric/chemical signals. Once a neuron gets the signal, it then process the signal and if it reaches certain threshold, it emits an output signal through an axon which is connected to the next neuron. The next neuron upon receiving the signal does the same and the process continues.

Artificial neural network (ANN) is vaguely inspired by the biological neural network. It is a collection of connected artificial neurons. Just like biological neuron, artificial neuron also takes input from one neuron, does some calculations and transmits the signal to another neuron that is connected to it.

Deep Neural Network (DNN) is an artificial neural network with multiple layers between input and output layers. Each neuron in one layer connects to all the neurons in the next layer. The one or more layers between input and output layers are called hidden layers .

Each connection that connects a neuron from one layer to the neuron in the previous layer has something called weight w, which tells how sensitive that our current neuron’s activation to the activation of the neuron in previous layer. Each neuron in a given layer has something called bias b. If you are familiar with linear regression, the bias term acts like an interceptor “c” in y= mx +c . If sum(mx) is not crossing the threshold but the neuron needs to fire, bias will be adjusted to lower that neuron’s threshold to make it fire.

Deep Neural Network. Image Source

The whole network looks very complicated right! But it’s not. Think of it as a giant function y = f(x), where x is your input, y is the output. Inside the function f(x), it then calls a chain of functions where one function’s output is passed onto another. These internal functions are nothing but hidden layers.

Single Artificial Neuron

Now let’s zoom in to a single artificial neuron. An artificial neuron has two parts to it. In the first part, it takes the input from the previous layer, the corresponding weights, biases and then does a linear transformation of those. The linear transformation is nothing but the sum of weighted inputs and bias.

Linear transformation of inputs

In the second part, it converts this linear transformation into a non-linear transformation, by using an activation function like sigmoid and emits the output of the activation function . There are other various activation functions like ReLu, but we are using sigmoid in this post. Because of this combination of linear and non-linear transformations along with multiple layers makes deep neural network so powerful so it can fit any complex data.

Activation function

The sigmoid function takes weighted sum and converts the value between 0 and 1. It converts -infinity to 0 and +infinity to 1. The value between 0 and 1 represents the activation strength of a particular neuron.

Sigmoid Function

The activation of a neuron at a given layer can be written as below

Activation function of single neuron at layer L

In a typical neural network, we will have more than one neuron in a given layer. The above equation can be represented in matrix form to include all neurons.

Matrix form of Activation at Layer L

Training Deep Neural Network

Deep neural network will learn from the given data by itself and will be used in predicting for the unseen data. But what we mean by learning from data?

As we already discussed, a DNN has set of weights and biases at each layer. The activation of a neuron is dependent on the corresponding weights and biases. So learning from data means finding out the best weights and biases of the network. But how do we find the weights and biases?

To find the weights and biases a deep neural network does the following;

  1. Assigns some random values to weights and biases
  2. Runs the training data (which has inputs and actual outputs) on the network using these randomly assigned weights and biases. During this the output of an activation function in one layer will be passed as input to the next layer until we get the output from the output layer. This process is called forward propagation.
  3. The initial output from the network will always be terrible since we have used random weights and biases. We compute the error (the difference between network’s prediction and actual output) by using some sort of a cost or error function. In this post we are going to use Sum of Squared Errors.
Sum Of Squared Errors

4. Since all neurons in the network contributed for the error above, the error proportion (error gradient) will be passed back from output layer to all layers excluding the input layer so that the weights and biases can be adjusted. This process of propagating error to adjust the weights and biases is called Back Propagation.

Since cost function is a function of weights and biases, the error gradient will be calculated using the partial derivatives of the cost function with respect to weights and biases.

To understand better, let’s take a simple network with one input layer, one hidden layer and one output layer. After first pass of forward propagation, we will have the error. Now we need to pass the error proportion back to all neurons in all layers.

First, let’s calculate the error gradient for a small change in weights and biases at the output layer. For simplicity, let’s write the activation function as function of function.

It’s time to refresh our high school/college multivariate calculus and find the partial derivatives of cost function C with respect to both weight and bias. Using the chain rule, the partial derivative of C w.r.t w and b can be written as follow.

Chain Rule

The partial derivatives of each component in the above equation are

Substituting the above values of partial derivatives on to the chain rule equation, The error gradients at the output layer w.r.t weights and biases are

Now, let’s calculate for the hidden layer

Note: Although L and L-1 represents output layer and hidden layer respectively, I have used the sub notation of ‘o’ for output and ‘h’ for hidden layer to be more clear

Similarly we can calculate the error gradient at all hidden layers if we have more than one. Since we only have one hidden layer, the back propagation stops here.

5. The above forward and backward propagation will be done iteratively and weights and biases will be adjusted until we find the optimal values. Instead of doing this like a brute force approach, we will use the gradient descent algorithm.

Gradient Descent

Gradient descent is an iterative optimization technique that can find the minimum of a function. It is used when finding optimal values of a function parameters is difficult through algebraically.

Intuition: Imagine a person standing on the steep of a valley. The person wants to get to the bottom of the valley, but he doesn’t know which direction takes him to the bottom. He takes one step, decides the next position based on the current position. If the step he took is towards the bottom he will continue in that direction, otherwise he will change his direction. He takes larger steps when the slope of the valley is steep and as reaching to the bottom of the valley, he takes smaller steps. Finally stops once he reaches the bottom of the valley.

Gradient Descent

Our objective here is to find the optimum values of weights and biases such that the cost function is minimum. The following steps are involved in gradient descent algorithm

  1. Assign random values for weights w and biases b and a constant value for learning rate
  2. Update weights and biases by using the gradient (we calculated using partial derivatives) and learning rate.

3. Repeat the step 2 until we find the minimum value or it reaches maximum iterations

Training Summary

let’s summarize the whole training process by writing a pseudo code for the network that has 1- input, 1- hidden and 1- output layer

initialize_weights_and_biases():
output_w = initialize_random_w
output_b = initialize_random_b
hidden_w = initialize_random_w
hidden_b = initialize_random_b
train(x_train, y_train, no_of_iterations, learning_rate):
# 1. initialize network weights and biases
initialize_weights_and_biases()
for iteration in range(no_of_iterations): #Run gradient descent algorithm no_of_iterations times #initialize delta of weights and biases
wo_delta = initialize_random_w_delta
bo_delta = initialize_random_b_delta
wh_delta = initialize_random_w_delta
wh_delta = initialize_random_b_delta
for x, y in zip (x_train, x_train): #Iterate through each sample in the training data
# 2.forward propagation
z_h =
hidden_w * x + hidden_b
a_h = sigmoid(z_h )
z_o = output_w * a+ output_b
predicted = sigmoid(z_o)
# 3.find the error
error = (predicted - y)
# 4.Back propagate the error
delta = 2 error * sigmoid_prime(z_o)
wo_delta+= delta * a_h
bo_delta+= delta
wh_delta+= delta * output_w * sigmoid_prime(z_h) * x
bh_delta+= delta * output_w * sigmoid_prime(z_h)

# 5. after 1 pass of all the inputs, update the network weights
output_w = output_w - learning_rate * wo_delta
output_b = output_b - learning_rate * bo_delta
hidden_w = hidden_w - learning_rate * wh_delta
hidden_b = hidden_b - learning_rate * bh_delta

Prediction

After training the neural network, we will have the optimal values of weights and biases at each layer. Prediction is nothing but performing one pass of forward propagation for the test data.

Building a Network to identify Handwritten Digits

Enough of the theory, let’s get our hands dirty by writing a python program to build a deep neural network. We are going to use mnist dataset and build a network that recognizes hand-written digits, the hello world program of Deep Neural Network.

mnist data consists of scanned handwritten images of size 28 x 28 pixels.

mnist data. Image Source

Again we will consider building a network with 1 input layer, 1 hidden layer and 1 output layer.

The following program is the python version of the pseudo code we discussed above. The only difference is we have introduced batch, because the mnist data has 60000 rows of data. Loading the entire 60000 rows in memory for every iteration is going to kill the memory.

__init__ initializes the weights and biases randomly for output and hidden layers.

forward_propagation performs forward propagation for the given input

update_mini_batch runs forward and back propagation for every record in the given batch. We are doing sum of error delta is because we are using the sum of squared errors and the partial derivative is the sum of error gradient of all samples.

            o_del_b, h_del_b, o_del_w, h_del_w = self.backprop(x,y)

o_b = o_b + o_del_b
h_b = h_b + h_del_b
o_w = o_w + o_del_w
h_w = h_w + h_del_w

After every batch run, it will update the network weights and biases

 self.o_weights = self.o_weights — (l_rate/len(batch))*o_w
self.h_weights = self.h_weights — (l_rate/len(batch))*h_w
self.o_biases = self.o_biases — (l_rate/len(batch))*o_b
self.h_biases = self.h_biases — (l_rate/len(batch))*h_b

backprop propagates the error gradient back to all layers excluding input layer. It is the heart of the neural network. As we discussed earlier we will calculate the partial derivatives of error function with respect to weights and biases at each layer. In the code we have used .transpose() method so that it follows the matrix multiplication rule (A X B is only possible if A is a matrix of mXn and B is matrix of nXp. The result matrix will be mXp).

        delta = (predicted - y) * sigmoid_prime(z_o)

o_del_b = delta
o_del_w = np.dot(delta, a_h.transpose())

delta = np.dot(self.o_weights.transpose(), delta) * sigmoid_prime(z_h)

h_del_b = delta
h_del_w = np.dot(delta, x.transpose())

fit method trains the network. It takes the input, shuffles it randomly, splits the data into batches. Invokes update_mini_batch function for each batch. it does these steps for every epoch.

To read the mnist data, we are going to use fetch_openml from sklearn.datasets package. We will use sklearn to split the data in to train and test

The mnist data has digitized images of handwritten digits, so it will have values from 0 to 255. To normalize the data, divide the input with 255 so that the image distribution is between 0 and 1

X = (X/255).astype('float32')

Since each image is 28 x 28 pixel and deep neural network expects the input in vector format, the input is converted in to (784,1) shape, because 28 * 28 = 784.

X = [np.reshape(x, (784, 1)) for x in X]

The network we are going to build has 10 neurons in the output layer as we need to identify the digits from 0 to 9. If the network identifies a given digit as 3 then the output neuron that is meant for 3 would have a value of 1 and all other neurons will have a value of 0.

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

Since mnist dataset has y value in the form of digit, we need to vectorize it such that it will be in the form mentioned above.

Let’s create a network object by specifying the number of neurons at each layer and train the network with train data

Here we have network with input layer of 784 neurons, hidden layer with 100 neurons (why 100 neurons? its a choice, we can use any number of neurons and see how the network behaves) and an output layer of 10 neurons

network.fit(train_data, 30, 10, 3.0)

The above statement, splits the input in to 10 batches and runs for 30 iterations with a learning rate of 3. The number of iterations, batch and learning rate are hyper parameters of the network. We need to do hyper parameter tuning to find out the best combination.

Accuracy

To find the accuracy of the model against test data, for every data perform the forward propagation, get the max value. If the value matches the y test that is correct prediction of the model. Sum of correct predictions over total test data gives the accuracy

For the network we built, the test accuracy is 96.59 % which is very good.

The full program is available at my git repository

Note : This post is inspired from the book neural networks and deep learning by Michael Nielsen

I hope you enjoyed the post. Happy Learning !!

--

--