After understanding forward and backward propagation, lets move onto calculating cost and gradient. This is vital component to neural networks.
So, to perform gradient descent or cost optimisation, we need to write a cost function which performs:
In this article we will deal with (3) and (4). You can click on the links above for a deep dive on forward/back prop.
So, just as a reminder, below is our neural network and we used forward and backward propagation and calculated Z, A and S.
After forward propagation, we have calculated A3 (as per figure 1). We can think of A3 as a hypothesis of our features(x) and this set of weights. So, let’s go ahead and calculate its cost to see how well these weights have performed.
Our first step is to calculate a penalty which can be used to regularise our cost. If you want an explanation on regularisation, then have a look at this article.
% calculate penalty without theta0,
p = sum(sum(Theta1(:, 2:end).², 2)) + sum(sum(Theta2(:, 2:end).², 2));
Now that we have a penalty, we can calculate the cost and apply the penalty. Later on, the cost optimization function will use this value to come up with the best weights we can use for predictions.
% Calculate the cost of our forward prop
J = sum(sum(-yv .* log(a3) — (1 — yv) .* log(1-a3), 2))/m + lambda*p/(2*m);
For cost optimisation, we also need to feed back the gradient of this particular set of weights. Figure 2 indicates what a gradient is once its been plotted. For the set of weights, being fed to our cost function, this will be the gradient of the plotted line.
Now, that we have this understanding, lets cover the calculations. Using matrix multiplication, we first calculate the delta’s using S2 and A1. Figure 3 visualises this delta for each of the theta’s.
% Calculate DELTA’s (accumulated deltas)
delta_1 = (s2'*a1);
delta_2 = (s3'*a2);
Again, we want to regularise our gradients, thus need to calculate a penalty.
% calculate regularized gradient, replace 1st column with zeros
p1 = (lambda/m)*[zeros(size(Theta1, 1), 1) Theta1(:, 2:end)];
p2 = (lambda/m)*[zeros(size(Theta2, 1), 1) Theta2(:, 2:end)];
Finally, we calculate the gradients for each theta and apply the weight. Figure 4 shows the gradients
% gradients / partial derivitives
Theta1_grad = delta_1./m + p1;
Theta2_grad = delta_2./m + p2;
However, the cost optimisation functions dont know how to work with 2 theta’s, so lets unroll these into a vector, with results shown in figure 5.
% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Having been through the 4 parts of this series, you are now ready to put it all together in part 5.