All about Linear Regression Assumptions Part 1

Pallavi Padav
Women in Technology
4 min readMar 2, 2024

--

https://www.visitnorway.com/things-to-do/outdoor-activities/ziplines/

We all know that a parametric test makes assumptions about a population’s parameters.

Linear regression is a parametric model hence given a data set with the continuous target variables, we must check certain conditions before choosing the Linear regression model. These conditions are known as Linear Regression assumptions.

Assumptions of Linear Regression

Based on the stage at which assumptions are validated we can divide it into two categories i.e. assumptions about the explanatory variables and assumptions about the error terms.

1. Assumptions about the explanatory variables

During the exploratory data analysis phase we need to check for dependent variable and feature variables.

  1. a) Linearity and additivity

This assumption states that a linear relationship exists between the dependent and independent (feature) variables.

Equation of line : Y = mX + C

Change in Y due to one unit change in X is constant, regardless of the value of X. An additive relationship suggests that the effect of X¹ on Y is independent of other variables.

Consider a multilinear model where X₁, X₂, …, Xₚ are the independent variables or the predator variables, and β₀, β₁, β₂, …, βₚ are their corresponding coefficients, in this case, the effects of the variables in the model should add up.

Y = β₀ + β₁X1 + β₂X2 + … + βₚXp

  1. b) No Multicollinearity in the data

High collinearity means that the two variables are similar and contain the same information. There should not be any relationship between independent variables. Since the effect of variables in linear regression is additive in nature multicollinearity will increase the complexity or in other words, variables provide redundant information and it's difficult to find the impact in isolation.

salary = β0 + β1(yearsofexperience) + β2(ageinyears)

In the above equation with the increase in years of experience, age also will increase. The question arises did the salary increase due to experience or age? Hence there is a need to remove multicollinearity else it will impact the performance of the model.

2. Assumptions about the error terms (residuals)

Once the above two assumptions are satisfied and the linear model is run we need to check for assumptions on error terms.

2. a) Homoskedasticity of the error terms

The term homoskedasticity derives from the Greek words ‘homos’ meaning ‘same’, and ‘skedastikos’, which means ‘scattering’ or ‘dispersion’. Homoskedasticity, therefore, means ‘having the same scatter/variance.’

https://www.fireblazeaischool.in/blogs/assumptions-of-linear-regression/

Error terms should have a constant variance across all observations.

If the error terms are not constant i.e. if you observe any patterns then they are known as heteroscedasticity. In this case, the standard errors would be large and the model would make bad predictions.

2. b) No Autocorrelation of the error terms

https://www.benchmarksixsigma.com/forum/topic/39359-autocorrelation/

When the residuals are dependent on each other, there is autocorrelation. This usually occurs in time series models where the next instant is dependent on the previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error.

Hence error term should be independent and unrelated to the other error terms.

2. c) Normal Distribution of error terms

The error terms must be approximately normally distributed.

https://towardsdatascience.com/are-the-error-terms-normally-distributed-in-a-linear-regression-model-15e6882298a4

If this assumption is violated, it is not a big problem, especially if we have a large number of observations. This is because the central limit theorem will apply if we have a large number of observations, implying that the sampling distribution will resemble normality irrespective of the parent distribution for large sample sizes.

However, if the number of observations is small and the normality assumption is violated, the standard errors in your model’s output will be unreliable.

2. d) Mean of Residuals is approximately zero

The residuals can be positive or negative due to that fact that data points can either be above, below or on the Linear regression line. Now, because we place the line “in the middle” of all data points, the sum of the positive and the sum of the negative residuals are equal. The positive and the negative errors cancel each other out. So the total sum of positive and negative errors is zero.

Refer to part2 of Linear Regression Assumptions to understand the validation of these assumptions using python.

EndNote:

I hope you enjoyed the article and got a clear picture of the assumptions of linear regression. Please drop your suggestions or queries in the comment section.

Would love to catch you on Linkedin. Mail me here for any queries.

Happy reading!!!!

I believe in the power of continuous learning and sharing knowledge with the community. Your contributions are invaluable in helping me create meaningful content and resources that benefit everyone. Join me on this journey of exploration and innovation in the fascinating world of data science by donating to Buy Me a Coffee.

--

--