Published in


6 Linear Regression Concepts That Are Easy To Miss!

Photo by Isaac Smith on Unsplash

Statistics and Data Science work strongly to predict the output variable based on values of predictor variables and anomaly detection. Regression diagnostics originally intended for analytics and improving regression models can also be used to detect anomalies in X or Y values.

A regression model that fits the data well will accurately capture the changes in Y due to any change in X. Other than that the regression equation does not prove the direction of causation. Hence conclusions about causations eg. that clicks on the ads lead to sales and not the other way round, must come from the contextual knowledge of the data scientist interpreting the model.

Correlation Vs. Regression

Simple Linear Regression (LR) models the relationship between the magnitudes of Xi(input) and Y(output) variables. While correlation measures the strength of the relationship between two variables, LR can quantify the nature of the relationship, e.g. as X increases by one unit, Y increases by b1 units.

Degrees of Freedom (N-1). Why does it matter?

Dof is the number of values free to vary in a model. Usually used as a denominator while calculating std. deviation, the variance of a population, and the number of dummy variables to be included in the regression model.

When variables are being standardized to be used in statistical tests (hypothesis tests) Dof is a part of the standardization formula to make sure that the standardized data matches the appropriate reference distribution (t-test, f-test, etc.)

Regression algorithms choke if there are redundant variables. For example take days of the week, although there are seven days, if 6 days are given to you (Sunday — Friday), the seventh day will then be fixed to have only one value (Saturday). while creating dummy variables, including all 7 days’ dummy variables would be including the redundant information of the seventh day — Saturday and will cause the regression model to fail due to a multicollinearity error.


RMSE or root mean squared error in regression is a widely used parameter to evaluate the model. However since in this formula we’re squaring the error, the outliers will have a bigger error and it will hence end up having more weightage while calculating RMSE. To reduce the sensitivity of the outliers or extreme values, Mean Absolute Error is used. Here since MAE averages the magnitude of errors, the bigger errors do not get magnified. MAE is usually always equal to or less than RMSE.

R sq. Vs Adjusted R sq.

R sq. or coefficient of determination, is the measure of the proportion of variance in the data that can be explained by the variables in the model. This means the more variables we keep adding to the model, the value of R sq will never decrease, only increase. In order to increase the value of R sq, which is desired, the model ends up being super complex (too many variables) which is not desirable. Hence to solve this problem, Adjusted R sq. value is used. It calculates R sq. only from significant variables.

t-statistic or p-value of a coeficient

Both t-statistic and p-value measure the statistical significance of the variable. This is where being fooled by randomness is ruled out, by looking at the p-value we can judge if we can rule out the null hypothesis: Coefficient = 0. If we’re able to rule this out (low p-value/ high t-statistic) then we can conclude that this variable is a significant addition to the model.

Occam’s Razor’s guide to choosing a better model — All things being equal a simpler model should be used in preference to a more complicated model.

Model Selection can be done in the following ways :

  1. Forward Step Regression
  2. Backward Step Regression
  3. Penalized Regression (Ridge/Lasso)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store