Linear Regression

Samrat Kar

Published in

Machine Learning And Artificial Intelligence Study Group

3 min readFeb 11, 2019

Ordinary least square or Residual Sum of squares (RSS) — Here the cost function is the (y(i) — y(pred))² which is minimized to find that value of β0 and β1, to find that best fit of the predicted line.

2. Gradient Descent

Gradient is one optimization method which can be used to optimize the Residual sum of squares cost function. There can be other cost functions. Basically it starts with an initial value of β0 and β1 and then finds the cost function. It then increases or decreases the parameters to find the next cost function value. This is done till a minima is found. Gradient descent expects that there is no local minimal and the graph of the cost function is convex.

3. Drawback of RSS

This is an absolute difference between the actual y and the predicted y. Now, if the units of the actual y and predicted y changes the RSS will change. So, we use the relative term R² which is 1-RSS/TSS

4. TSS — total sum of squares.

Instead of adding the actual value’s difference from the predicted value, in the TSS, we find the difference from the mean y the actual value.

RSS is shown below

TSS is shown below : (Dotted line is ȳ)

5. Relationship between TSS, RSS and R²

TSS works as a cost function for a model which does not have an independent variable, but only y intercept (mean ȳ). This gives how good is the model without any independent variable. When independent variable is added the model performance is given by RSS. The ration of RSS/TSS gives how good is the model as compared to the mean value without variance. Lesser is this ratio lesser is the residual error with actual values, and greater is the residual error with the mean. This implies that the model is more robust. So, 1-RSS/TSS is considered as the measure of robustness of the model and is known as R²

PS : Whenever you compute TSS or RSS, you always take the actual data points of the training set. The difference in both the cases are the reference from which the diff of the actual data points are done. In the case of RSS, it is the predicted values of the actual data points. In case of TSS it is the mean of the predicted values of the actual data points.

6. RSE : Residual squared error = sqrt(RSS/n-2)

7. Assumptions of Linear regression
a. Linear relationship between X and Y
b. Error terms are normally distributed. (Not X and Y).
c. Error terms have zero mean
d. Error terms are independent of each other
e. Error terms have constant variance.

PS — There is no assumption for the distribution of X or Y. It is just about the error terms which are normally distributed.

8. Significance of the coefficients β1, β2,β3..

a. If X is related to Y, we say the coefficients are significant. The stronger is the relation, more significant is the coefficient. If there is no relationship, then the values are not significant.

Linear Regression

Written by Samrat Kar