Simple Linear Regression, Cost Function & Gradient Descent

Published in

Analytics Vidhya

6 min readDec 31, 2020

Linear regression is a powerful statistical technique and machine learning algorithm used to predict the relationship between two variables or factors usually for continuous data.

Fig 1: Linear Regression | GIF: Towards Data Science

Application of Linear Regressions

Linear Regression mainly used in understanding Business and factors influencing profitability, to evaluate trends in business and forecasts. Linear regression is used to analyze market effectiveness, pricing, and promotions of products. It is also used to assess risk in banking sectors and used in the Recommender Systems.

Let's consider a simple example of house rate prediction where a house rate is predicted based on the given area of the house.

Using the above data we can construct a scatter plot from which a regression line is passed. The regression line is passed in such a way that the line is closer to most of the points (Fig 1). That can be achieved by minimizing the cost function. Since there is one dependent variable that is the area which can be considered as X and the price to be predicted is Y so we can come up with a linear equation Y = c1 + c2*X where given X value Y can be easily calculated.

What should be the value of c1 & c2?

The value of these constants exactly decides the regression line in such a way that line is closer to the maximum number of points. Let’s consider 3 simple data points where X and Y are predefined and we need to come up with an optimal line fit for this data.

By observation, we can see X=Y therefore what should be the value of c1 and c2 such that the line pass through most of the data points.

we can again observe that by varying c1 and c2 in the equation Y = c1 + c2*X we get different lines among that we can observe Fig 4c where the line passes through all points which is the best fit. Now we can tell that the optimal value for c1 & c2 is 0 & 1 respectively, substitute these value in the equation we get Y=0 +1X → Y=X So, now we can predict Y for any given X value by using this equation. Similarly, we can plot a scatter plot for House Data (Fig 2) and find the best fit for those.

It’s not as simple as we did for the 3 data points above, now we have millions of data points for House Data

How does a computer know which line is the best fit?

This is when the Cost function comes into the picture, Cost function calculates the average error (Loss Function) and our goal is to reduce the cost function as much as possible to get the best fit of the line.

Cost Function is J(c1,c2) =1/2m ∑( Y`- Y)² comonly written as below equation Note: (c1,c2)=(θ₀,θ₁) & Y` =Y(hat) = hypothesis

Y → actual value (ground truth)
Y` = c1+c2X → (predicted value) known as hypothesis i.e

m → number of data points
1/2m and squared error is to calculate the average and simplify math

Cost Function :

Now let's understand the equation J(c1,c2) =1/2m ∑( Y`- Y)² by solving it using the examples in (Fig 5)

Fig 5a → c1= 2 & c2 =0 therfore Y`= 2 and m=3

J(c1,c2)=(1/2*3)*((2–1)²+(2–2)²+(2–3)²) = 0.33

Fig 5b → c1= 0 & c2 =0.5 Y`=0+0.5X → Y`=0.5X and m=3

J(c1,c2)=(1/2*3)*((0.5–1)²+(1–2)²+(1.5–3)²) = 0.58

Fig 5c → c1= 0& c2 =1 Y`=0+X → Y`=X and m=3

J(c1,c2)=(1/2*3)*((1–1)²+(2–2)²+(3–3)²) = 0

Comparing all the above examples Fig 5c gives the least Cost function therefore we can tell Fig 5c with c1=0 & c2=1 is the best fit. Hence now we know which line is the best fit and how to calculate the cost function to get the best fit line that is closest to most of the data points.

Note: To calculate the cost function we need to know the value of c1 and c2 in advance where c1 and c2 can vary in range depending on the data set, it can vary in negative as well to get a better fit . So, how to update the values of c1 and c2 dynamically untill reach the best fit?

How to update values of c1 and c2 Dynamically?

This can be solved by an algorithm called Gradient Descent which will find the local minima that is the best value for c1 and c2 such that the cost function is minimum. If the cost function is minimum that is when our regression line fits best.

Note: c1 and c2 is nothing but mostly known as the parameter which when tweaked we get the best fit for the regresssion line (c1,c2) →(θ₀,θ₁) these are also known as weights which are being calculated in the machine learning algorithms and stored as a model which predicts the output Y` when the given input is X.

The whole idea of gradient descent is that we can give any random initial value for the c1 and c2 then using a gradient descent algorithm update c1 and c2 every iteration considering all data in each iteration by evaluating the cost function for each iteration. Which takes a step towards local minima.

Gradient Descent

c1,c2 → two parameters from which cost function can be calculated
J(c1,c2) → cost function explained above in Fig 5
α → Learning Rate used for gradient descent of the parameters

We can see in the above equation c1 and c2 are updated by finding the partial differentiation of the cost function and multiplying it by learning rate(α).

Note: c1 and c2 or (θ₀,θ₁…) any number of parameters have to be updated simultaneously. I will be publishing new article in detail why this and maths behind gradient descent

Summary:

All You have to do is calculate parameters using the below equation and your model is ready to predict. Cheers !!