Neural networks cost and gradient calculation deep dive 104

Shaun Enslin
Jul 9 · 3 min read

After understanding forward and backward propagation, lets move onto calculating cost and gradient. This is vital component to neural networks.

This is part 2 in my series on neural networks. You are welcome to start at part 1 or skip to part 5 if you just want the code.

So, to perform gradient descent or cost optimisation, we need to write a cost function which performs:

  1. Forward propagation
  2. Backward propagation
  3. Calculate cost & gradient

In this article we will deal with (3) and (4). You can click on the links above for a deep dive on forward/back prop.

So, just as a reminder, below is our neural network and we used forward and backward propagation and calculated Z, A and S.

Figure 1

Cost calculation

After forward propagation, we have calculated A3 (as per figure 1). We can think of A3 as a hypothesis of our features(x) and this set of weights. So, let’s go ahead and calculate its cost to see how well these weights have performed.

Our first step is to calculate a penalty which can be used to regularise our cost. If you want an explanation on regularisation, then have a look at this article.

% calculate penalty without theta0,
p = sum(sum(Theta1(:, 2:end).², 2)) + sum(sum(Theta2(:, 2:end).², 2));

Now that we have a penalty, we can calculate the cost and apply the penalty. Later on, the cost optimization function will use this value to come up with the best weights we can use for predictions.

% Calculate the cost of our forward prop
J = sum(sum(-yv .* log(a3) — (1 — yv) .* log(1-a3), 2))/m + lambda*p/(2*m);

Gradients

For cost optimisation, we also need to feed back the gradient of this particular set of weights. Figure 2 indicates what a gradient is once its been plotted. For the set of weights, being fed to our cost function, this will be the gradient of the plotted line.

Figure 2

Now, that we have this understanding, lets cover the calculations. Using matrix multiplication, we first calculate the delta’s using S2 and A1. Figure 3 visualises this delta for each of the theta’s.

% Calculate DELTA’s (accumulated deltas)
delta_1 = (s2'*a1);
delta_2 = (s3'*a2);
Figure 3

Again, we want to regularise our gradients, thus need to calculate a penalty.

% calculate regularized gradient, replace 1st column with zeros
p1 = (lambda/m)*[zeros(size(Theta1, 1), 1) Theta1(:, 2:end)];
p2 = (lambda/m)*[zeros(size(Theta2, 1), 1) Theta2(:, 2:end)];

Finally, we calculate the gradients for each theta and apply the weight. Figure 4 shows the gradients

% gradients / partial derivitives
Theta1_grad = delta_1./m + p1;
Theta2_grad = delta_2./m + p2;
Figure 4

However, the cost optimisation functions dont know how to work with 2 theta’s, so lets unroll these into a vector, with results shown in figure 5.

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Figure 5

Conclusion

Having been through the 4 parts of this series, you are now ready to put it all together in part 5.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Shaun Enslin

Written by

Coding, technology and data are my passions. Oh, and some crypto trading with lots of cycling on the side. https://www.linkedin.com/in/shaun-enslin-4984bb14b/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com