An exploratory guide to deep learning — Backpropagation

5 min readFeb 25, 2018

This is part 2 of the series,for the introduction — part 1.In machine learning we often start with an inductive bias i.e before we let the data determine the right set of weights or parameters , we constrict the function space.For example, in case of linear models(svm’s,linear regression,random forests etc) we expect the underlying distribution to fall into our hypothesis space before actually looking at the data.Deep learning though offering infinite flexibility in terms of functions it can approximate,it’s been hard to train big networks(with lot’s of weights).

Consider the above figure from part 1,we initially start with random weights that throws out a random output.Now if we could find a sensible way of measuring the departure of this output and tweak the weights(preferably all of them at once) we could expect the network to eventually have the right set of weights.But,we could as well change them one by one (in a network with millions of weights).That would save us this post(no,backprop…).Is that a reasonable policy…let’s figure out.We will start with a linear model and see if we could find the right weights by constructively trying out different values.

Look at the above code carefully ,we started out with random ‘w’ and changed it by 0.5(arbitrary choice) in the direction of decreasing l_sum.Here loss is defined as (y_pred- y).

But,in case of neural networks we got no way of knowing the direction of change in ‘w’ to decrease the loss as the relationship between loss and parameters is no longer linear..(i.e we can’t expect to reduce ‘w’ to reduce loss).Also if we got multiple weights say w1,w2 in our model we have to perturb them together (4 combinations of w1 and w2 i.e increase w1 and w2 slightly,increase w1 and decrease w2 etc.) to find the right combination.I think we can quickly see now how a network with millions of weights becomes infeasible to tweak for right weights.

CALCULUS AND THE CHAIN RULE :

This wouldn’t be complete introduction to calculus but a refresher if you have already tackled it once.Particularly,we would try to figure out how to tweak each variable of a multivariable function to move it in the right direction.

Consider the above function,in case of one variable functions derivative refers to the ratio of the change in ‘y’ when there is a slight nudge in ‘x’.But,if ‘f’ depends on more variables than we could define partial derivative i.e change in f on changing only one variable keeping the rest constant.In the above function if we let y = 1 as constant ,than P.D w.r.t ‘x’ refers to the solpe of the function on the plane.Similarly we could have P.D w.r.t ‘y’.(i.e ‘x’ as constant )

So but at a given point on a function we could pass a plane in any direction.Thus,our nudge in independent variables is a vector.

slight nudge can be in any of the arbitrary direction.

So,this brings about the concept of directional derivative.Our task now is to find out the right direction to decrease our loss function,this would let us tweak all the variables at once and eventually reach global minimum.We would later prove gradient is the direction of steepest increase of a function.Thus we have to move in the opposite direction to rapidly decrease a function locally.

Gradient or the direction of steepest ascent :

gradient of a function as a vector indicating the direction of steepest increase at a point.

I think it is obvious to see directional derivative as the change in function when we move along a direction.It could be represented as a dot product between gradient and the unit vector in that direction….

To prove gradient as the direction of steepest ascent,it’s helpful to see directional derivative as the dot product between gradient and the unit vector w.

So ,to maximize the change in function for a given change along w,it’s sensible to make both gradient and w point along the same direction.

Thus , a function changes most rapidly along the direction of gradient.Now ,how to compute the gradient of a function with millions of weights faster…..

The idea of backprop is pretty simple,once if we could see how the chain rule makes sense i.e change in f due to slight nudge in x is equal to the product of chage in q(intermediate variable) and change in f due to this change.The reason for defining these intermediate variables might seem quite arbitrary at first,but if i say at each node if i keep track of the partial derivative of the output w.r.t inputs that would let me update all the weights at once..(look at the figure carefully to verify)

Thus a complex loss function can be represented as a computational graph(the one shown below) with each node keeping track of forward computation and gradients of the output w.r.t inputs..

An exploratory guide to deep learning — Backpropagation

CALCULUS AND THE CHAIN RULE :

Gradient or the direction of steepest ascent :

Written by vinnu vinay