Gradient Descent

Pradyumna Yadav
Analytics Vidhya
Published in
6 min readFeb 9, 2021

Gradient Descent is the basic parameter optimization technique used in the field of machine learning. It is actually based on the slope of the cost function with respect to the parameter. Let’s consider an example :

You are an adventurous person and you got the chance to climb Mount Everest, Earth’s highest peak above sea-level. Your team starts climbing and somehow after a lot of effort you all reach the top. You all decided to celebrate this victory, so one of your friends popped a champagne bottle. Being a keen observer, you saw the cork rolling down the mountain and it goes in the direction where the terrain is steepest.

The gradient Descent algorithm shows the same behavior as the hilly terrain. But here the hilly terrain is the cost function.

Cost Function

We have to go down this cost function hill to reach the global minima or the optimal value of cost function.

The slope of our cost function is determined by partial derivatives of the parameters. Let us discuss cost functions in detail.

Machine Learning Model

Consider two variables X and Y each with 100 different values. Where X is a feature variable and Y is the target variable.

This is the scatter-plot of our data

Let us consider a hypothesis : Y’ = mX + b

This X and Y data is fed into our hypothesis model and our model predicts the value Y’ for each value of X, whereas m and b are the slope of hypothesis function and bias respectively.

Hypothesis model

Thus Y is the actual value for the corresponding X, whereas Y’ is the predicted outcome when X is fed into the model.

Hence it might be possible that Y and Y’ may have different values. Right?

Difference = Y’(predicted) - Y(actual)

This difference is the error of the model. More specifically this error helps us to define our Cost Function.

Cost Function

Cost Function/Loss function helps to evaluate the performance of the model. Cost function gives the average loss of the whole training example, Here N is the total number of training examples.

Cost Function is given by the sum of the squared error.

Cost Function

We take total squared error as it makes error distances larger, makingbad predictions are noticable than good predictions and it is easy to compute the derivative of a squared function.

The cost function hill

Now that we have defined our hilly terrain, we are ready to start our journey along the slope and let the gravity do all of the work. Wheeeeeee.

Minimizing the Cost Function

At the end of the day, our motive is only to reduce the cost to get an optimal value of parameters corresponding to that value of the cost function.

To minimize the cost we go along the slope of the cost function to get the lowest point (global minima).

The slope of the cost function is determined by the partial differentiation of cost with respect to the parameters m(slope of hypothesis function) and b(bias).

Derivative w.r.t. m
derivative w.r.t. b

These derivatives give us the slope of our cost function hill. but how are we going to proceed down this hill?

To proceed along the slope, we need to define our steps along the slope such that we will reach the lowest point (global minima).

Learning Rate: α

Our steps are determined by the learning rate. Higher the learning rate, bigger will be our steps, thus we may overshoot the desired point. Whereas for a small learning rate, it might take us a large number of steps to reach the lowest point.

A wrongly chosen learning rate may result in our cost function never arrive at the global minima.

This value is never a specific value, it varies with cost functions, thus one has to tweak this value to reach the optimal result for his/her cost function.

Updating the parameters after each step

Now that we have our slope of the hill and defined our steps too, we are good to go along the slope.

When we give our data X and Y to the hypothesis function, our cost function calculates the loss after each iteration, and depending on the loss we get our steep side with the help of the derivatives with respect to our parameters m and b. Thus after getting the steep side of the cost function, the value of the cost function comes closer to the minimum value and our parameters m and b are updated.

Value of m is updated
value of b is updated

This is how the linear regression line gets updated with the slope and the bias.

And finally we get the optimized value for slope and bias, here is the final graph

Optimized value of slope and bias

This is the python implementation of gradient descent algorithm, you can find it in the link which I have provided at the end.

def gradient_descent(X,y):
m = b = 0
learning_rate = 0.01
epochs = 1000
n = len(X)
for epoch in range(epochs):
y_predicted = m*X + b
cost = (1/(2*n))*sum([val**2 for val in (y_predicted-y)])
derivative_wrt_m = (1/n)*sum((y_predicted-y)*X)
derivative_wrt_b = (1/n)*sum(y_predicted-y)
m = m - learning_rate * derivative_wrt_m
b = b - learning_rate * derivative_wrt_b
return m,b,cost

In this way we get our optimal values of parameters m and b, corresponding to the minimum value of the cost function. Hence we get our minima.

Reaching to the minima

I have implemented this algorithm in jupyter notebook which is in my Github repository. Please have a look.

Here, we have reached global minima which was the desired value of the cost function. But sometimes it may happen that instead of reaching global minima, the value of cost function gets stuck at the local minima.

Well, that’s the story for the next part.

Thanks.

--

--