Assumptions of Linear Regression

Sakshi Manga
Analytics Vidhya
Published in
4 min readMay 18, 2020

Linear Regression is a standard technique used for analyzing the relationship between two variables. It is a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

In this blog we are going to learn about some of its assumptions and how to check their presence in a data set.

  1. Linear Distribution
  2. Presence of Normality
  3. MultiCollinearity
  4. Auto Correlation
  5. HomoScedasticity

Let’s discuss them one by one:

1. Linear Distribution: It is defined as a relationship between two features where change in one feature can easily explain change in another feature i.e relationship between each independent variable and target variable should be linear and to check for linear distribution we can simply plot a scatter plot.

2. Presence of Normality: As we know there are N number of distributions in statistics and if the number of observations is greater than 30 for any variable then we can simply assume it to be normally distributed(Central Limit Theorem).

Presence of Normality simply means that all the features that will be a part of the “X” feature matrix should obey a normal distribution and to check its presence we can use a Histogram.

3.MultiCollinearity: It is defined as the correlation between features used for regression analysis. It is a measure of correlation among all the columns used in the “X” feature matrix.

For a good regression analysis, we don’t want the features to be heavily dependent upon each other as changing one might change the other. We need very little or no multicollinearity and to check for multicollinearity we can use the Pearson’s correlation coefficient or a heatmap.

4.AutoCorrelation: It can be defined as correlation between adjacent observations in the vector of prediction(or dependent variable). Sometimes the value of y(x+1) is dependent upon the value of y(x) which again depends on the value of y(x-1). Mostly stock Market or any Time-Series analysis dataset can be counted as an example of auto-correlated data and we can use line plot or geom plot to check its presence.

5. Homoscedasticity: (Homo = similar | scedasticity = error) It can be defined as a property of regression models where the errors (“noise” or random disturbance between input and output variables) are almost similar across all values of the input variables.

If the errors keep changing drastically, this will result in a funnel shaped scatter plot and can break our regression model and condition follows Heteroscedasticity and we can use scatter plot to check its presence in the dataset.

Now, let”s just understand them one by one diagramatically

Consider a dataset having three features and one target variable.

What are the features?

TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)

Radio: advertising dollars spent on Radio

Newspaper: advertising dollars spent on Newspaper

What is the response?

Sales: sales of a single product in a given market (in thousands of widgets)

  1. Linear Distribution : To check this we need to make a scatter plot between each independent variable and target variable.

2.Presence of Normality : We need to draw Histograms between each independent variable and Dependent variable.

3.MultiCollinearity: To check for multicollinearity we can use the Pearson”s correlation coefficient or a heatmap.

4. .AutoCorrelation:

5. Homoscedasticity:

--

--

Sakshi Manga
Analytics Vidhya

I am a passionate Data Scientist with a strong interest in developing and maintaining ML/DL models. Teaching DS in my free time is my favorite avocation.