Chapter 1.2: Gradient Descent with Math.

Published in

Deep Math Machine learning.ai

4 min readSep 26, 2017

This story I wanna talk about a famous machine learning algorithm called Gradient Descent which is used for optimizing the machine leaning algorithms and how it works including the math.

From chapter 1 we know that we need to update m and b values, we call them weights in machine learning. Lets alias b and m as − θ0 and θ1 (theta 0 and theta 1 ) respectively.

First time we take random values for θ0 and θ1, and we calculate y

y = θ0+θ1*X
In machine learning we say hypothesis so h(X) = θ0+θ1*X

h(X)=y but this y is not actual value in our data-set, this is predicted y from our hypothesis.

For example lets say our data-set is something like below and we take random values which are 1 and 0.5 for θ0 and θ1 respectively.

From this we calculate the error which is

error = (h(x)-y)² --> (Predicted - Actual)²   
error = (6-5)² = 1² is to get rid of negative values (what if Actual y=6 and Py=5)

we just calculated the error for one data point in our data-set , we need to repeat this for all data points in our data set and sum up the all errors to one error which is called Cost Function ‘J(θ)’ in machine learning.

Our goal is to minimize the cost function (error) we want our error close to zero Period.

we have the error 1 for first data-point so lets treat that as whole error and reduce to zero for sake of understanding.

for (h(x)-y)² function we get always positive values and graph will look like this(Left) and lets plot the error graph.

Here is the gradient descent work comes into the picture.

By taking the little steps down to reach the minimum value (bottom of the curve) and changing the θ values in the process.

How does it know how much value it should go down???

The answer is in Math.

It draws the line(Tangent) from the point.
It finds the slope of that line.
It identifies how much change is required by taking the partial derivative of the function with respective to θ
The change value will be multiplied with a variable called alpha(learning rate) we provide the value for alpha usually 0.01
It subtracts this change value from the earlier θ value to get new θ value .

From above picture we can define our θ0 and θ1.

And alpha here is a learning rate usually we give 0.01 but it depends, it tells how big the step-size is towards reaching the minimum value.

**θ0 and θ1 values(Left),more than two θ’s (Right)**

Again we know our J(θ0,θ1) so if we apply this to above equations for θ0 and θ1, we get our new θ0 and θ1 values.

How to calculate the derivatives???

For example f(x) =x² → df/dx=2x How ???

How to calculate the partial derivatives???

its same as calculating derivatives but here we calculate the derivative with respective to that value , others are constants (so d/dx(constant)=0)

The same thing we can apply for calculating partial derivative with respective to θ0 and θ1.

How come that box drawn disappeared in the next step above? just wait and see.

For calculating partial derivative with respective to θ1 is also same as above except one little part is added

**θ0 box disappeared because value is 1 (Top)**

So Final picture is

Hope its not confusing , and I know its little bit hard to grasp in the beginning but I am sure that this will make sense as you go through again and again.

So That’s it for this story , In the next story I will cover another interesting topic in machine learning so See ya!

Update : Code for Gradient Descent and linear regression

Chapter 1.2: Gradient Descent with Math.

Written by Madhu Sanjeevi ( Mady )