All the Backpropagation derivatives
So you’ve completed Andrew Ng’s Deep Learning course on Coursera,
You know that ForwardProp looks like this:
And you know that Backprop looks like this:
But do you know how to derive these formulas?
TL;DR
Full derivations of all Backpropagation derivatives used in Coursera Deep Learning, using both chain rule and direct computation.
If you’ve been through backpropagation and not understood how results such as
and
are derived, if you want to understand the direct computation as well as simply using chain rule, then read on…
Our Neural Network
This is the simple Neural Net we will be working with, where x,W and b are our inputs, the “z’s” are the linear function of our inputs, the “a’s” are the (sigmoid) activation functions and the final
is our Cross Entropy or Negative Log Likelihood cost function.
So here’s the plan, we will work backwards from our cost function
and compute directly, the derivative of
with respect to (w.r.t) each of the preceding elements in our Neural Network:
As well as computing these values directly, we will also show the chain rule derivation as well.
# Note: we don’t differentiate our input ‘X’ because these are fixed values that we are given and therefore don’t optimize over.
[1] Derivative w.r.t activation function
So to start we will take the derivative of our cost function
w.r.t the activation function
So we are taking the derivative of the Negative log likelihood function (Cross Entropy) , which when expanded looks like this:
First lets move the minus sign on the left of the brackets and distribute it inside the brackets, so we get:
Next we differentiate the left hand side:
The right hand side is more complex as the derivative of ln(1-a) is not simply 1/(1-a), we must use chain rule to multiply the derivative of the inner function by the outer.
The derivative of (1-a) = -1, this gives the final result:
And the proof of the derivative of a log being the inverse is as follows:
[2] Derivative of sigmoid
It is useful at this stage to compute the derivative of the sigmoid activation function, as we will need it later on.
our logistic function (sigmoid) is given as:
First is is convenient to rearrange this function to the following form, as it allows us to use the chain rule to differentiate:
Now using chain rule: multiplying the outer derivative by the inner, gives
which rearranged gives
Here’s the clever part. We can then separate this into the product of two fractions and with a bit of algebraic magic, we add a ‘1’ to the second numerator and immediately take it away again:
The RHS then simplifies to
Which is nothing more than
Which gives a final result of
Or alternatively:
[3] Derivative w.r.t linear function
To get this result we can use chain rule by multiplying the two results we’ve already calculated [1] and [2]
So if we can get a common denominator in the left hand of the equation, then we can simplify the equation, so lets add ‘(1-a)’ to the first fraction and ‘a’ to the second fraction
with a common denominator we can simplify to
now we multiply LHS by RHS, the a(1-a) terms cancel out and we are left with just the numerator from the LHS!
which if we expand out gives:
note that ‘ya’ is the same as ‘ay’, so they cancel to give
which rearranges to give our final result of the derivative
our final result is
[4] Derivative w.r.t weights
This derivative is trivial to compute, as z is simply
and the derivative simply evaluates to
[5] Derivative w.r.t weights (2)
This derivative can be computed two different ways! We can use chain rule or compute directly. We will do both as it provides a great intuition behind backprop calculation.
To use chain rule to get derivative [5] we note that we have already computed the following
Noting that the product of the first two equations gives us
if we then continue using the chain rule and multiply this result by
then we get
which is nothing more than
or written out long hand
So that’s the ‘chain rule way’. Now lets compute ‘dw’ directly:
To compute directly, we first take our cost function
We can notice that the first log term ‘ln(a)’ can be expanded to
Which simplifies to:
And if we take the second log function ‘ln(1-a)’ which can be shown as
taking the log of the numerator ( we will leave the denominator) we get
This result comes from the rule of logs, which states: log(p/q) = log(p) — log(q).
Plugging these formula back into our original cost function we get
Expanding the term in the square brackets we get
The first and last terms ‘yln(1+e^-z)’ cancel out leaving:
Which we can rearrange by pulling the ‘yz’ term to the outside to give
Here’s where it gets interesting, by adding an exp term to the ‘z’ inside the square brackets and then immediately taking its log
next we can take advantage of the rule of sum of logs: ln(a) + ln(b) = ln(a.b) combined with rule of exp products:e^a * e^b = e^(a+b) to get
followed by
Pulling the ‘yz’ term inside the brackets we get :
Finally we note that z = Wx+b therefore taking the derivative w.r.t W:
The first term ‘yz ’becomes ‘yx ’and the second term becomes :
Note that the 2nd term is nothing but
Which gives a final result of
We can rearrange by pulling ‘x’ out to give
which gives
[6] derivative w.r.t bias
Again we could use chain rule which would be
This is easy to solve as we already computed ‘dz’ and the second term is simply the derivative of ‘z’ which is ‘wX +b’ w.r.t ‘b’ which is simply 1!
so the derivative w.r.t b is simply
which we already calculated earlier as
For completeness we will also show how to calculate ‘db’ directly. To calculate this we will take a step from the above calculation for ‘dw’, (from just before we did the differentiation)
remembering that z = wX +b and we are trying to find derivative of the function w.r.t b, if we take the derivative w.r.t b from both terms ‘yz’ and ‘ln(1+e^z)’ we get
its important to note the parenthesis here, as it clarifies how we get our derivative.
Taking the LHS first, the derivative of ‘wX’ w.r.t ‘b’ is zero as it doesn’t contain b! The derivative of ‘b’ is simply 1, so we are just left with the ‘y’ outside the parenthesis.
for the RHS, we do the same as we did when calculating ‘dw’, except this time when taking derivative of the inner function ‘e^wX+b’ we take it w.r.t ‘b’ (instead of ‘w’) which gives the following result (this is because the derivative w.r.t in the exponent evaluates to 1)
this term is simply our original
so putting the whole thing together we get
which we have already show is simply ‘dz’!
So that concludes all the derivatives of our Neural Network. We have calculated all of the following:
Wrapping up
And what about the result:
well, we can unpack the chain rule to explain:
Note that the term
is simply ‘dz’ the term we calculated earlier:
and the term
evaluates to W[l] or in other words, the derivative of our linear function Z =’Wa +b’ w.r.t ‘a’ equals ‘W’.
and finally the term in blue
is simply
‘da/dz’ the derivative of the the sigmoid function that we calculated earlier!
As a final note on the notation used in the Coursera Deep Learning course, in the result
we perform element wise multiplication between DZ and g’(Z), this is to ensure that all the dimensions of our matrix multiplications match up as expected.
So there we have it…
… all the derivatives required for backprop as shown in Andrew Ng’s Deep Learning course.
Simply reading through these calculus calculations (or any others for that matter) won’t be enough to make it stick in your mind. The best way to learn is to lock yourself in a room and practice, practice, practice!
What next?
If you got something out of this post, please share with others who may benefit, follow me Patrick David for more ML posts or on twitter @pdquant and give it a cynical/pity/genuine round of applause!