Gradient descent in python with example

Prashant
Analytics Vidhya
6 min readFeb 10, 2021

--

If we have any function and we want find the extremum of that function whether it is minima or maxima we use gradient descent optimization technique . so in this blog we will see gradient descent and its variants in python. this blog assumes that you already know little bit about derivative and linear regression

Let we have function y=f(x)=x**2+3

Now here we can see minimum value occurs when x=0 . now if we want to find minimum value of any function we take derivative of it with respect to variable(x) and then equate it to 0 .

Y= x**2 +3

dy/dx=d(x**2+3)/dx

dy/dx=2x

2x=0 # equating it with zero

X=0

So when x=0 then we will find minimum of this function.

Y=0**2+3=3

So 3 is the minimum value we can get here.

Function can have multiple minima or maxima which is local minima or maxima but it can have only one global minima or maxima.

But sometimes the function can be complex so if you want to find minima and maxima just by equating derivative to zero (dy/dx=0) is not always possible .For example. F(x)=log (1+ exp(ax)) now solving this equation is not trivial so here the gradient descent comes to rescue.

gradient descent:

Gradient descent is iterative algorithm . first we make a guess of what our variable (x) is .

F(x)=x**2 now in this equation only variable is x. now the function can have multiple variables too. For understanding purpose I am just taking simple example.

So x* is optimal or best value where we can get minimum value for our function. Initially you randomly pick x0 point. Then we try to find x1 which is closer to x* than x0.

Above equation is called as update function. Where Γ(gamma) is step size and [df/dx]xo is derivative at x0 so we reach at x1. So we continue doing this until we reach at at our optimal point which is x*.

For the first iteration it makes big jump and then size of jump gets reducing with iteration. It also depends on learning rate that how much jump it will make . we have to choose appropriate learning rate.

So we know function of line y= mx+ c this same function we use for linear regression. So here for understanding we will design a function and I will ignore the bias term(intercept) for easy calculation.

So let y_pred=w * x let here we will take value of w=0.5 actually we have to calculate value of w but here we are designing experiment for understanding purpose.

Let given height of person we have to predict weight. we have 10 student

So we have height we can predict weight if we know w. so we have to calculate the optimal w* which can reduce the error between actual and predicted values. We know formulation of linear regression .

And we can find optimal w* by gradient descent .

Why stochastic gradient descent?

In above equation if dataset is huge for example our dataset size in millions then for each update of weight we have to go through all the data point and calculate derivative millions of time , so it is very computationally expensive so here stochastic gradient descent is very useful , which is one of the variants of gradient descent . Actually there are three variants of gradient descent .

Let n=total number of data points

1] stochastic gradient descent : batch size=1

2] mini batch gradient descent : batch size=k (where 1 < k < n)

3] Batch gradient descent : batch size= n

1] stochastic gradient descent (SGD): In sgd weight update happens after processing each data point , therefore backpropagation takes place after processing each data point.

Now at each iteration we use 1 point and calculate the gradient and update the weight.

Let n=1000

So batch size=1

Like this how many iteration do we need to traverse the whole dataset?

1000/1=1000

Now we perform 1000 iteration its called one epoch.

2]mini match gradient descent :

n=1000

mini batch size=k=5 ( k should be: 1 < k < n )

now at each iteration we use 5 data points calculate the average gradient and update the weight .

like this how many iteration do we need to traverse the whole dataset?

1000/5=200

Now we perform 200 iteration and its called one epoch.

3] Batch gradient descent:

N=1000

Batch size=1000

Now at each iteration we use 1000 points and calculate the average gradient and update the weights .

Like this how many iteration do we need to traverse the whole dataset?

1000/1000=1

This is nothing but gradient descent . when batch size=number of data points then it is nothing but gradient descent .

stochastic gradient descent converges faster than batch gradient descent . it may be noisy but it converges faster .

lets see code in python

first we created our data set . x=height of person , y=weight of person

then we have cost function .

calculating derivative of each point

Batch gradient descent or regular gradient descent

Here i have take learning rate very small gamma=0.000001 because my dataset size is very small n=10. Typically we take learning rate around 0.01 or 0.001 or 0.1 .

optimal value is 0.48597498598079936

we can see how it is converging towards optimal w , which is 0.5 here we have reached near it our w here is 0.48597498598079936.

Stochastic gradient descent

optimal value w is  0.4999999999999996

Mini batch gradient descent

optimal value w is  0.4999314426019738

comparison all variants of gradient descent

Conclusion:

we can see that sgd converges faster as compared to regular gradient descent though it can be noisy . we can consider mini batch gradient descent as average of batch gradient descent and stochastic gradient descent.

Reference:

--

--