All the Backpropagation derivatives

Patrick David

11 min readJun 7, 2018

So you’ve completed Andrew Ng’s Deep Learning course on Coursera,

You know that ForwardProp looks like this:

And you know that Backprop looks like this:

But do you know how to derive these formulas?

TL;DR

Full derivations of all Backpropagation derivatives used in Coursera Deep Learning, using both chain rule and direct computation.

If you’ve been through backpropagation and not understood how results such as

The derivative of our linear function - dz

and

are derived, if you want to understand the direct computation as well as simply using chain rule, then read on…

Our Neural Network

Neural Net taken from Coursera Deep Learning.

This is the simple Neural Net we will be working with, where x,W and b are our inputs, the “z’s” are the linear function of our inputs, the “a’s” are the (sigmoid) activation functions and the final

Cross Entropy cost function

is our Cross Entropy or Negative Log Likelihood cost function.

So here’s the plan, we will work backwards from our cost function

and compute directly, the derivative of

Cross Entropy cost function

with respect to (w.r.t) each of the preceding elements in our Neural Network:

The derivatives of L(a,y) w.r.t each element in our NN

As well as computing these values directly, we will also show the chain rule derivation as well.

# Note: we don’t differentiate our input ‘X’ because these are fixed values that we are given and therefore don’t optimize over.

[1] Derivative w.r.t activation function

[1] derivative of our activation function

So to start we will take the derivative of our cost function

w.r.t the activation function

Activation function 2

So we are taking the derivative of the Negative log likelihood function (Cross Entropy) , which when expanded looks like this:

First lets move the minus sign on the left of the brackets and distribute it inside the brackets, so we get:

distribute minus sign

Next we differentiate the left hand side:

The right hand side is more complex as the derivative of ln(1-a) is not simply 1/(1-a), we must use chain rule to multiply the derivative of the inner function by the outer.

The derivative of (1-a) = -1, this gives the final result:

And the proof of the derivative of a log being the inverse is as follows:

[2] Derivative of sigmoid

[2] derivative of sigmoid

It is useful at this stage to compute the derivative of the sigmoid activation function, as we will need it later on.

our logistic function (sigmoid) is given as:

Sigmoid (Logistic) function

First is is convenient to rearrange this function to the following form, as it allows us to use the chain rule to differentiate:

Rearranged sigmoid function

Now using chain rule: multiplying the outer derivative by the inner, gives

outer derivative x inner derivative

which rearranged gives

put RHS over LHS

Here’s the clever part. We can then separate this into the product of two fractions and with a bit of algebraic magic, we add a ‘1’ to the second numerator and immediately take it away again:

add a ‘1’ and subtract a ‘1’ on RHS

The RHS then simplifies to

Which is nothing more than

1 minus our sigmoid

Which gives a final result of

Or alternatively:

This notation will be easier

[3] Derivative w.r.t linear function

[3] derivative of our linear function (z = wX + b)

To get this result we can use chain rule by multiplying the two results we’ve already calculated [1] and [2]

multiply derivative [1] by derivative [2]

So if we can get a common denominator in the left hand of the equation, then we can simplify the equation, so lets add ‘(1-a)’ to the first fraction and ‘a’ to the second fraction

add ‘(1–a)’ and ‘a’ to get common denominator

with a common denominator we can simplify to

now we multiply LHS by RHS, the a(1-a) terms cancel out and we are left with just the numerator from the LHS!

the remaining numerator

which if we expand out gives:

expanded out

note that ‘ya’ is the same as ‘ay’, so they cancel to give

which rearranges to give our final result of the derivative

[3]

our final result is

derivative of our linear function (z = wX +b)

[4] Derivative w.r.t weights

[4] derivative of linear func ‘z’ w.r.t weights ‘w’

This derivative is trivial to compute, as z is simply

linear function ‘z’

and the derivative simply evaluates to

derivative of ‘z’ w.r.t ‘w’

[5] Derivative w.r.t weights (2)

[5] derivative of cost func w.r.t weights ‘w’

This derivative can be computed two different ways! We can use chain rule or compute directly. We will do both as it provides a great intuition behind backprop calculation.

To use chain rule to get derivative [5] we note that we have already computed the following

Noting that the product of the first two equations gives us

if we then continue using the chain rule and multiply this result by

then we get

which is nothing more than

The final result for ‘dw’

or written out long hand

So that’s the ‘chain rule way’. Now lets compute ‘dw’ directly:

To compute directly, we first take our cost function

Cross Entropy cost function

We can notice that the first log term ‘ln(a)’ can be expanded to

expanding ‘ln(a)’

Which simplifies to:

And if we take the second log function ‘ln(1-a)’ which can be shown as

taking the log of the numerator ( we will leave the denominator) we get

log of the numerator

This result comes from the rule of logs, which states: log(p/q) = log(p) — log(q).

Plugging these formula back into our original cost function we get

plugged back into cost function

Expanding the term in the square brackets we get

terms inside bracket expanded

The first and last terms ‘yln(1+e^-z)’ cancel out leaving:

Which we can rearrange by pulling the ‘yz’ term to the outside to give

Here’s where it gets interesting, by adding an exp term to the ‘z’ inside the square brackets and then immediately taking its log

we exponentiate ‘e^z’ then take its log

next we can take advantage of the rule of sum of logs: ln(a) + ln(b) = ln(a.b) combined with rule of exp products:e^a * e^b = e^(a+b) to get

ln(a) + ln(b) = ln(a.b)

followed by

add e^(z +-z)

Pulling the ‘yz’ term inside the brackets we get :

Finally we note that z = Wx+b therefore taking the derivative w.r.t W:

take derivative w.r.t W

The first term ‘yz ’becomes ‘yx ’and the second term becomes :

taking derivative of logs again

Note that the 2nd term is nothing but

Which gives a final result of

We can rearrange by pulling ‘x’ out to give

which gives

[6] derivative w.r.t bias

[6] derivative w.r.t bias b

Again we could use chain rule which would be

This is easy to solve as we already computed ‘dz’ and the second term is simply the derivative of ‘z’ which is ‘wX +b’ w.r.t ‘b’ which is simply 1!

so the derivative w.r.t b is simply

which we already calculated earlier as

derivative of our linear function (z = wX +b)

For completeness we will also show how to calculate ‘db’ directly. To calculate this we will take a step from the above calculation for ‘dw’, (from just before we did the differentiation)

note: z = wX + b

remembering that z = wX +b and we are trying to find derivative of the function w.r.t b, if we take the derivative w.r.t b from both terms ‘yz’ and ‘ln(1+e^z)’ we get

its important to note the parenthesis here, as it clarifies how we get our derivative.

Taking the LHS first, the derivative of ‘wX’ w.r.t ‘b’ is zero as it doesn’t contain b! The derivative of ‘b’ is simply 1, so we are just left with the ‘y’ outside the parenthesis.

for the RHS, we do the same as we did when calculating ‘dw’, except this time when taking derivative of the inner function ‘e^wX+b’ we take it w.r.t ‘b’ (instead of ‘w’) which gives the following result (this is because the derivative w.r.t in the exponent evaluates to 1)

this term is simply our original

so putting the whole thing together we get

final result

which we have already show is simply ‘dz’!

‘db’ = ‘dz’

So that concludes all the derivatives of our Neural Network. We have calculated all of the following:

Wrapping up

And what about the result:

well, we can unpack the chain rule to explain:

‘dz’ using chain rule

Note that the term

is simply ‘dz’ the term we calculated earlier:

and the term

evaluates to W[l] or in other words, the derivative of our linear function Z =’Wa +b’ w.r.t ‘a’ equals ‘W’.

and finally the term in blue

is simply

[2] derivative of sigmoid

‘da/dz’ the derivative of the the sigmoid function that we calculated earlier!

As a final note on the notation used in the Coursera Deep Learning course, in the result

we perform element wise multiplication between DZ and g’(Z), this is to ensure that all the dimensions of our matrix multiplications match up as expected.

So there we have it…

… all the derivatives required for backprop as shown in Andrew Ng’s Deep Learning course.

Simply reading through these calculus calculations (or any others for that matter) won’t be enough to make it stick in your mind. The best way to learn is to lock yourself in a room and practice, practice, practice!

What next?

If you got something out of this post, please share with others who may benefit, follow me Patrick David for more ML posts or on twitter @pdquant and give it a cynical/pity/genuine round of applause!

Stocks Significance Testing & p-Hacking

Stocks, Significance Testing & p-Hacking: How volatile is volatile?

October is historically the most volatile month for stocks, but is this a persistent signal or just noise in the data?

medium.com

Build a Bit(Cointegration) Backtester

Build a BitCoin(tegration) Backtester

Learn the statistical technique of Cointegration and build your own crypto backtester to create and test a quantitative…

medium.com

All the Backpropagation derivatives

TL;DR

Our Neural Network

[1] Derivative w.r.t activation function

[2] Derivative of sigmoid

[3] Derivative w.r.t linear function

[4] Derivative w.r.t weights

[5] Derivative w.r.t weights (2)

[6] derivative w.r.t bias

Wrapping up

So there we have it…

What next?

Stocks Significance Testing & p-Hacking

Stocks, Significance Testing & p-Hacking: How volatile is volatile?

October is historically the most volatile month for stocks, but is this a persistent signal or just noise in the data?

Build a Bit(Cointegration) Backtester

Build a BitCoin(tegration) Backtester

Learn the statistical technique of Cointegration and build your own crypto backtester to create and test a quantitative…

Written by Patrick David