Build a Deep Neural Network using Python

Jayani Hewavitharana
Analytics Vidhya
Published in
8 min readJul 8, 2020
Example of a Neural Network

Deep Learning and Neural Networks is a very popular concept in Data Science which makes use of huge amounts of data to train models that can do very accurate classifications. Neural networks are widely used in image and speech recognition and natural language processing.

I did some basic studying on neural networks (there are many resources online to learn just about anything these days) and it turns out, the concept is fairly easy to grasp. Having a bit of knowledge on linear algebra and calculus will suffice to understand the basics of neural networks.

There are many frameworks that can be used to model and train a neural network like Tensorflow and Keras. Using these frameworks will make implementation so easy, but why not try things the hard way and implement a neural network from scratch.

So let’s build a model using python to train and test a neural network with any number of layers consisting of any number of hidden units(neurons).

Before diving into implementation, first, let’s see what needs to be done. We can break down the process into a few steps.

  1. Initialize the parameters
  2. Forward propagation
  3. Compute the cost
  4. Backward propagation
  5. Update the parameters

Now let’s see how to implement these steps.

First things first. We need a dataset. The dataset I used to train and test the model is make_moons from sklearn datasets. You can see the documentation for the dataset here.

You can also explore many other datasets from sklearn . (make_moons dataset is in the ‘Generated Datasets’ section)

First, install sklearn datasets and import the dataset.

from sklearn.datasets import make_moons

Now let’s load the dataset and set it up for training and testing.

X,Y = make_moons(n_samples=2000, shuffle=True, noise=None, random_state=None) X = X.T
m = X.shape[1]
Y = Y.reshape(1,m)

In this implementation,

X is an n*m matrix where n is the number of features and m is the number of examples.

Y is a 1 *m vector

You can switch the dimensions of X and Y but then you should switch the dimensions of the weights accordingly.

We will use 80% of the dataset to train the model and 20% for testing.

margin = m//10*8 
X_train, X_test = X[:, :margin], X[:, margin:]
Y_train, Y_test = Y[:, :margin], Y[:, margin:]

Now that we have the dataset ready, it’s time to implement the model. We’ll implement the model step by step.

Initialize the Parameters

This is where we initialize the weights and biases of each layer. To initialize the parameters, we take as input, an array containing the number of units(neurons) in each layer. For example [n, 4, 1] would mean 3 layers with n input features, and 4 and 1 neurons in the hidden layers respectively.

Then we loop through the layers and initialize the weights and biases and store them in a dictionary ({W1: _ , b1: _ , W2: _ , b2: _ , ………})

In this implementation,

The dimensions of the weights matrix of a particular layer l (W[l]) is, (number of neurons in layer l, number of neurons in layer l-1)

The dimensions of the biases vector of a particular layer l (b[l]) is, (number of neurons in layer l, 1)

These dimensions depend on the dimensions of X and Y. If you switch the dimensions of X and Y, the dimensions of W and b should be switched as well.

Always initialize the weights with random values to break the symmetry. If all weights and biases are initialized to zeros, the activation in all the neurons of a single layer will be the same.

Forward Propagation

In this step, the activation for each layer is computed by taking the activation of the previous layer as the input.

Forward propagation can be broken down into 2 sub-steps:

Although I am using the sigmoid activation function to compute the activation in all the layers, it is not the best option. There are other activation functions such as tanh and ReLU that work better than sigmoid for the hidden layers.

To implement forward propagation, we need to implement the sigmoid function.

Now let’s calculate the activation of a single layer.

You may have noticed that aside from the activation for the layer, we return a cache where we store some values from this layer. You’ll see why when we get to the backward propagation part.

Now that we have the function to calculate the activation of a single layer, we can use it for forward propagation. What we need to do is calculate the activation of each layer using the activation of the previous layer up to the final layer.

From this function, we return AL, the activation of the final layer (or y hat). We also return the caches, an array containing all the caches that we collect in each layer. You’ll see the use of the caches in a bit.

Compute the Cost

Now we compute the cost to see how different the AL that we have calculated by forward prop is to Y. For this, we will use the cross-entropy cost.

Cross-entropy cost
Cross-entropy cost vectorized

You can also use the mean squared error to compute the cost.

Backward Propagation

Now comes the tricky part, backprop! Our ultimate goal is to minimize the cost by adjusting the parameters. For that, we use gradient descent. To adjust the parameters using gradient descent, we need to find the partial derivative of the cost, with respect to each of the parameters.(dW1, db1, dW2, db2, ……)

If you know the basics of calculus, backprop is not that hard. By using the chain rule, we can find the partial derivatives of the cost with respect to each of the parameters.

To help backprop, here are the equations we used for forward prop:

Now let’s derive dWL (the derivative of the cost with respect to WL), where L is the final layer, using the chain rule.

Now let’s see if we can derive each of these terms.

The derivative of cost w.r.t AL (if you use mean squared error for the cost, this term will be different)

Let’s take the multiplication of the first two derivatives as dZ[L](derivative of cost with respect to Z[L]). We will need it later.

So if we have Z[L]and the activation of the previous layer (A[L-1]), we will be able to compute dW[L]. Remember that we stored some values in a cache in each layer during forward prop? If you don’t remember what we stored, scroll back up, and see. We stored A_prev and Z! So if we access the cache of the final layer, we’ll be able to compute dW[L]easily.

Now let’s go back one more layer and see if we can derive a pattern for each dW[l].

Recall that,

Then we can reduce the chain to,

Now let’s see if we can derive the rest of the terms.

(from Z[l] = A[l-1] * W[l] + b[l])

So if we have W[L], Z[L-1], and A[L-2], we can compute dW[L-1]. And if you go back and see, we have stored the values we need, in the cache.

If we take the derivative of cost with respect to A[L-1] as,

then we can generalize an equation to get the derivative of the cost with respect to the weights in a single layer as,

The derivative of the cost with respect to the biases of the layer can be written using the same chain rule,

In this case, the last term will be 1. So,

Phew! So now that the derivations are out of the way, let’s implement backprop.

First of all, we need a function to calculate the sigmoid derivative.

Now, let’s apply our derived equation to a single layer.

Now let’s use that function and apply it to all the layers starting from the last layer down to the first layer.

We use the caches we stored during forward prop to compute the derivatives in each layer and store them in a dictionary. ({dW1: _ , db1: _ , dW2: _ , db2: _ , ………})

Update Parameters

Finally, we can use the derivatives we computed and update the parameters.

So we have finished all the steps. Now it’s time to put it all together. We initialize the parameters once, then execute the rest of the steps in a loop until the cost is minimized. (We can decide how many iterations we need)

And that’s our model. You can also store the cost for each iteration in an array and plot it against the no. of iterations to see the cost decreasing.

Now, all we have to do is, run this function for the dataset we loaded earlier. We’ll train a model with 3 layers with 5, 3, and 1 unit each with a learning rate of 0.005 for 10,000 iterations.

layer_units = [X_train.shape[0], 5, 3, 1]
parameters = neural_network(X_train, Y_train, layer_units, 0.005, 10000)

And that’s it!

Now you can implement a predictor function to predict the values of the train and test set and calculate the accuracy. You can use different datasets and activation functions and also play around with the learning rate and hidden layers and hidden units and see how the model performs.

Hope you understood the code and most importantly, the concept. The complete code can be viewed in my GitHub repository.

--

--