# All the Backpropagation derivatives

So you’ve completed Andrew Ng’s Deep Learning course on Coursera,

You know that ForwardProp looks like this:

And you know that Backprop looks like this:

But do you know how to derive these formulas?

# TL;DR

*Full derivations of all Backpropagation derivatives used in Coursera Deep Learning, using both chain rule and direct computation.*

If you’ve been through backpropagation and not understood how results such as

and

are derived, if you want to understand the direct computation as well as simply using chain rule, then read on…

# Our Neural Network

This is the simple Neural Net we will be working with, where x,W and b are our inputs, the “z’s” are the linear function of our inputs, the “a’s” are the (sigmoid) activation functions and the final

is our Cross Entropy or Negative Log Likelihood cost function.

So here’s the plan, we will work backwards from our cost function

and compute directly, the derivative of

with respect to (*w.r.t*) each of the preceding elements in our Neural Network:

As well as computing these values *directly*, we will also show the *chain rule *derivation as well.

**# Note: we don’t differentiate our input ‘X’ because these are fixed values that we are given and therefore don’t optimize over.**

# [1] Derivative w.r.t activation function

So to start we will take the derivative of our cost function

w.r.t the activation function

So we are taking the derivative of the Negative log likelihood function (Cross Entropy) , which when expanded looks like this:

First lets move the minus sign on the left of the brackets and distribute it inside the brackets, so we get:

Next we differentiate the left hand side:

The right hand side is more complex as the derivative of ln(1-a) is not simply 1/(1-a), we must use chain rule to multiply the derivative of the inner function by the outer.

The derivative of (1-a) = -1, this gives the final result:

And the proof of the derivative of a log being the inverse is as follows:

# [2] Derivative of sigmoid

It is useful at this stage to compute the derivative of the sigmoid activation function, as we will need it later on.

our logistic function (sigmoid) is given as:

First is is convenient to rearrange this function to the following form, as it allows us to use the chain rule to differentiate:

Now using chain rule: multiplying the outer derivative by the inner, gives

which rearranged gives

Here’s the clever part. We can then separate this into the product of two fractions and with a bit of algebraic magic, we add a ‘1’ to the second numerator and immediately take it away again:

The RHS then simplifies to

Which is nothing more than

Which gives a final result of

Or alternatively:

# [3] Derivative w.r.t linear function

To get this result we can use chain rule by multiplying the two results we’ve already calculated [1] and [2]

So if we can get a common denominator in the left hand of the equation, then we can simplify the equation, so lets add ‘(1-a)’ to the first fraction and ‘a’ to the second fraction

with a common denominator we can simplify to

now we multiply LHS by RHS, the a(1-a) terms cancel out and we are left with just the numerator from the LHS!

which if we expand out gives:

note that ‘ya’ is the same as ‘ay’, so they cancel to give

which rearranges to give our final result of the derivative

our final result is

# [4] Derivative w.r.t weights

This derivative is trivial to compute, as z is simply

and the derivative simply evaluates to

# [5] Derivative w.r.t weights (2)

This derivative can be computed **two different ways!** We can use **chain rule **or **compute directly**. We will do both as it provides a great intuition behind backprop calculation.

To use chain rule to get derivative [5] we note that we have already computed the following

Noting that the product of the first two equations gives us

if we then continue using the chain rule and multiply this result by

then we get

which is nothing more than

or written out long hand

So that’s the ‘*chain rule way*’. Now lets compute ‘dw’ *directly*:

To compute **directly**, we first take our cost function

We can notice that the first log term ‘ln(a)’ can be expanded to

Which simplifies to:

And if we take the second log function ‘ln(1-a)’ which can be shown as

taking the log of the numerator ( we will leave the denominator) we get

This result comes from the rule of logs, which states: log(p/q) = log(p) — log(q).

Plugging these formula back into our original cost function we get

Expanding the term in the square brackets we get

The first and last terms ‘yln(1+e^-z)’ cancel out leaving:

Which we can rearrange by pulling the ‘yz’ term to the outside to give

Here’s where it gets interesting, by adding an exp term to the ‘z’ inside the square brackets and then immediately taking its log

next we can take advantage of the rule of sum of logs: ln(a) + ln(b) = ln(a.b) combined with rule of exp products:e^a * e^b = e^(a+b) to get

followed by

Pulling the ‘yz’ term inside the brackets we get :

Finally we note that z = Wx+b therefore taking the derivative w.r.t W:

The first term ‘yz ’becomes ‘yx ’and the second term becomes :

Note that the 2nd term is nothing but

Which gives a final result of

We can rearrange by pulling ‘x’ out to give

which gives

# [6] derivative w.r.t bias

Again we could use **chain rule** which would be

This is easy to solve as we already computed ‘dz’ and the second term is simply the derivative of ‘z’ which is ‘wX +b’ w.r.t ‘b’ which is simply 1!

so the derivative w.r.t b is simply

which we already calculated earlier as

For completeness we will also show how to calculate ‘db’ **directly**. To calculate this we will take a step from the above calculation for ‘dw’, (from just before we did the differentiation)

remembering that z = wX +b and we are trying to find derivative of the function w.r.t b, if we take the derivative w.r.t b from both terms ‘yz’ and ‘ln(1+e^z)’ we get

its important to note the parenthesis here, as it clarifies how we get our derivative.

Taking the LHS first, the derivative of ‘wX’ w.r.t ‘b’ is zero as it doesn’t contain b! The derivative of ‘b’ is simply 1, so we are just left with the ‘y’ outside the parenthesis.

for the RHS, we do the same as we did when calculating ‘dw’, except this time when taking derivative of the inner function ‘e^wX+b’ we take it w.r.t ‘b’ (instead of ‘w’) which gives the following result (this is because the derivative w.r.t in the exponent evaluates to 1)

this term is simply our original

so putting the whole thing together we get

which we have already show is simply ‘dz’!

So that concludes all the derivatives of our Neural Network. **We have calculated all of the following:**

# Wrapping up

And what about the result:

well, we can unpack the chain rule to explain:

Note that the term

is simply ‘dz’ the term we calculated earlier:

and the term

evaluates to W[l] or in other words, the derivative of our linear function Z =’Wa +b’ w.r.t ‘a’ equals ‘W’.

and finally the term in blue

is simply

‘da/dz’ the derivative of the the sigmoid function that we calculated earlier!

As a final note on the notation used in the Coursera Deep Learning course, in the result

we perform element wise multiplication between DZ and g’(Z), this is to ensure that all the dimensions of our matrix multiplications match up as expected.

# So there we have it…

… all the derivatives required for backprop as shown in Andrew Ng’s Deep Learning course.

Simply reading through these calculus calculations (*or any others for that matter*) won’t be enough to make it stick in your mind. The best way to learn is to lock yourself in a room and **practice, practice, practice!**

# What next?

If you got something out of this post, please share with others who may benefit, follow me Patrick David for more ML posts or on twitter @pdquant and give it a cynical/pity/genuine round of

applause!