The 4 Most Fundamental yet Overlooked Assumptions of Linear Regression..

5 min readOct 29, 2022

This is the second article in ML_Algorithms_A_to_Z Series:
 
* Which focuses primarily on When? to use Linear Regression and it's 4 underlying assumptions. Link to previous article 📝:- You can refer to the previous article which covers Linear Regression and Gradient Descent basics.This series can be effectively used as a quick refresher for Data Science Interview preparations as we go from ground-up to Intermediate level concepts.The series of blogs aims🎯 at a fundamental and conceptual level understanding of various ML algorithms and does not emphasize a lot on code implementations as there are various amazing resources out there to practice on real-time data. Let's get started..

Let’s take a scenario where you are given a data with an independent variable (eg:- House Price — easiest example ever ..) which is continuous in nature and you’re task is to fit a model that uses features (X) in the dataset to explain the independent variable (y). You think it’s obviously a Regression task and do the below:

Clean the data, check for missing values
Split the data into Training data, Validation data and Testing data.
Scale the data and apply proper transformations for Feature Engineering.
Call the linear model from it’s respective estimator class (sklearn.linear_model)
Fit the linear model on the training data blindly assuming a Linear relationship. (.fit())
Score the data using the default scoring metric i.e R² and achieves a score of 0.2–0.6 (.score()) on testing data.

Now, before we assume that Linear regression is not a good model and move onto a more complex model like SVR or Random Forests, one must first ensure the data complies with the underlying assumptions of Linear regression. This can result in a tremendous increase in R² results and a better Regression model overall.

THE 4 FAMOUS YET NEVER IMPLEMENTED ASSUMPTIONS:

It can be remembered by the acronym LIHN.

Linear Relationship
Independence
Homoskedasticity
Normality

Assumption 1: There exists a Linear relationship btw X and y:

How to check?

Scatter plots can be very useful in judging if two variables are linearly relatable or not. In Figure-1 we can clearly see that there exists a relationship however it isn’t linear in nature.

Ways to fix it?

Adding Polynomial features ( Read more about this at the end of the article):

A useful way to address a non-linear relationship as a linear relationship is to add higher terms as per the degree of the equation.
We can implement this in Scikit-Learn using PolynomialFeatures() by specifying the degree in the parameters to add terms to the equation.

Figure 1: X vs y where there is a non-linear relationship

2. Applying a non-linear transformation:

Note:- We do this because modeling non-linear relationships are relatively more complex and tend to produce more errors.

A non-linear transformation (like logarithmic, power, exponential model) allows for very skewed datasets to become close to normal distribution. Taking the log of one or both variables will effectively change the case from a unit change to a percent change.

Assumption 2: Consecutive residuals do not have a pattern among them

This is mostly trouble when working with Time series data and can be ignored when working with cross-sectional data i.e data that is collected in a single time frame.

Assumption 3: Homoskedasticity:

How to check?

Create a Fitted value vs Residual plot.

When we measure the variance of y at every level of X we should find the variance to be constant. In Figure 2, we see that the right plot indicates changing Variance as X changes.

The above mentioned problem is called as Heteroskedasticity and can cause some variables which are not actually significant to be significant when performing tests of significance at different values of α.

Note: A cone shape similar to the one in Fig 2, right subplot is clearly indicative of Heteroskedasticity.

Figure 2: Variance of fitted value vs residuals.

Ways to fix it?

Transform the dependent variable(y):

Applying log(y) or the dependent variable can ensure heteroskedasticity to be removed.

2. Redefining the dependent variable(y):

We can redefine the dependent variable to have lower variance over larger populations. As we increase the population the general idea is that the data converges to a normal distribution thus removing the heteroskedasticity.

Assumption 4: The residuals are normally distributed:

How to check?

Q-Q plot:

We can use Quantile- Quantile plots to determine if the residuals are normally distributed. A typical Q-Q plot looks like the below in Figure 3 and is indicative of the residuals being normally distributed.

Figure 3: Q-Q Plot showing normal distribution

2. Shapiro Wilk and Smironov Tests:

These tests can also be used to check for normality. However, as Q-Q plots are more visual they generally provide a better understanding of the normality assumption of residuals.

How to fix this?

Apply transformations:

A general rule of thumb when most of the assumptions are not met is to apply various types of transformations such as (inverse, log, power, exponential) to ensure the underlying assumptions are met.

2. Check and remove huge outliers:

Datasets with huge outliers tend to reject the assumption, hence cleaning and transforming the data as needed is important.

Till now we have discussed the various assumptions for Linear Regression and the various ways to fit it.

Hope this article has given you a good idea about when to use Linear regression and how to ensure the assumptions of Linear Regression are met.

The next article focuses on the Polynomial regression and Regularized Regression models.

Thanks for reading this article and stay tuned for more content. If you like the content please do drop in a clap/ comment. Any scope for improvement / feedback would be highly appreciated.

The 4 Most Fundamental yet Overlooked Assumptions of Linear Regression..

THE 4 FAMOUS YET NEVER IMPLEMENTED ASSUMPTIONS:

Assumption 1: There exists a Linear relationship btw X and y:

Assumption 2: Consecutive residuals do not have a pattern among them

Assumption 3: Homoskedasticity:

Assumption 4: The residuals are normally distributed:

Written by Parichay Pothepalli