Neural Network Implementation: Derivatives, chain rule and multiplications.

Published in

Analytics Vidhya

5 min readAug 20, 2019

In this article we have tried to focus on implementation of mathematics behind the neural networks. Before reading this blog it is recommended that you should have some basic understanding of Neural Networks.

There are two major phases of neural network:

Forward Propagation
Back Propagation

Neural Network(NN): A Neural Network is a supervised algorithm, where we have input data(independent variables) and output labels(dependent variable). By using the training data we will train NN to predict the output variable. In the beginning, NN makes some predictions which are almost random.These predictions are compared with the actual output and the error is computed as the difference between predicted and actual output. Our objective is to train the NN to reduce this error/cost function.

Now, let us understand the NN implementation with mathematics with the help of one example. The NN that we are going to create has the following visual representation.

Above diagram represents a NN with two hidden layers, each having 4 neurons. At the input layer we have two neurons indicating two feature columns in the training data.

Forward Propagation:During the feed forward step of NN, predictions are made using input node values, weights and bias.
Mathematically:
zh21 = x1w1 + x2w2 +b
ah21 = activation_function(zh21)
The similar type of computation will occur at each node and the computation will propagate forward till the last node
Back-Propagation(BP):Once we finish the forward propagation, we get prediction at the output layer as ao. With this output and the actual label, cost function / error is represented as below:

cost function / error

Our objective is to fine tune the network parameters to minimize the error term(cost function), in such a way that the predictions get nearer to actual labels.
If you look at our neural network, you'll notice that we can only control the weights and the bias. In order to minimize the cost, we need to find the weight and bias values for which the cost function returns the smallest value possible. The smaller the cost, the more accurate our predictions are.This is an optimization problem where we have to find the function minima.
To find the minima of a function, we can use the gradient descent algorithm.

Mathematical Implementation of Back-Propagation:

Weights introduced at each layer of neural network are responsible for introducing an error in the prediction. So, we have to update the weights at each layer. While doing the back-propagation, we will update the weights in reverse order starting from the output layer to the first hidden layer.
As shown in the diagram above, we have to sequentially update weights at three layers, i.e. output layer(BP_Phase1), hidden_layer_2(BP_Phase2) and hidden_layer_1(BP_Phase3)
While updating the weights we will use gradient descent as:

gradient descent

Here in the above equation, J is the cost function. Basically what the above equation says is: find the partial derivative of the cost function with respect to each weight and bias and subtract the result from the existing weight values to get the new weight values.
Now, step by step we will update weights at all the three layers as mentioned above.

Gradient Descent: Please refer this video for in depth understanding.

Take the derivative of the loss function for each parameter(intercept /weight) in it.
Pick random values for the parameters
Plug the parameter values into the derivatives
Calculate the step sizes: step_size = Slope*Learning_Rate
Calculate the new parameters:
New Parameter = Old Parameter-step_size
Repeat the last three steps until the step size is very small or has reached to maximum number of steps.

A) BP Phase 1:In this phase, we will update weights at the output layer. The mathematics part which plays role here is derivatives, chain rule and multiplications.

We will compute derivative of cost function w.r.t. weights at this layer as dcost_dwo
As we do not have values of these terms directly, we will use the chain rule to compute them as shown below:

Now, we are ready to compute equation 2.
Now we can update weights at the output layer using above term as:
wo -= lr * dcost_dwo,
where lr: learning rate.

B) BP Phase 2: In this phase, we will update weights at the hidden layer 2.

Now, we are ready to compute equation 3.
Now we can update weights at the output layer using above term as:
wh1 -= lr * dcost_dwh1
where lr: learning rate.

C) BP Phase 3: In this phase, we will update weights at the hidden layer 1.

Now, we are ready to compute equation 4.
Now we can update weights at the output layer using above term as:
wh2 -= lr * dcost_dwh2
where lr: learning rate.

Finally we have updated all the weights and hence completed our first epoch. For the next epoch these updated weights will be used and the process will continue for the number of epochs.

Source Code:

Complete Script in Python

To keep it small and focused on mathematical part we have not explained all the terminologies in detail.

References:

Neural Network Implementation: Derivatives, chain rule and multiplications.

Written by Varsha Dhumal