Multicollinearity-A Beginner’s guide

Arjith Babu
Analytics Vidhya
Published in
8 min readSep 19, 2020

Regression is the way of describing the relationship between a dependent variable and independent variables. What if the independent variables are related to themselves? This concept is called multicollinearity.

Multicollinearity is a state of very high correlation among the independent variables, i.e. A predictor variable can be used to predict another predictor variable.

Both the independent features have a similar impact on the dependent variable so the regression model fails to understand the individual effect of each independent variables on the dependent variable.

Reasons for multicollinearity.

  • It can be caused by the inaccurate use of dummy variables.
  • It can be caused by the inclusion of a variable which is computed from other variables in the data set.
  • Multicollinearity can also result from the repetition of the same kind of variable.- e.g. Sex and Gender

Let us consider the following scenario -

Salary of a person in an organization is a function of ‘Years of experience’, ‘Age’, ‘X3’,’ X4’…

Salary = β0+ β1 (“Years of experience”)+ β2(“Age”)+…

β1 — The marginal effect on salary for an additional unit in Years of experience, holding other variables constant

β2- The marginal effect on salary for an additional unit of Age, holding other variables constant

Multicollinearity is when the independent variables themselves are correlated so that the individual effects are obscure.

What regression does is it tears apart the individual effect of β1 and β2 on “Salary”.

But, the problem here is, the more experienced a person is, probably the older they get at the same time. So regression can’t differentiate the impact of ‘Years of experience’ and ‘Age’ on ‘Salary’,

It fails to understand whether the increase in “Age” has led to an increased “Salary” or increase in “Years of experience” have led to an increased “Salary”.

So can we hold the other variables constant?

Let’s see an with an example.

Consider the following Data frame -

DataFrame

From this data, we will build a regression model to predict salary.

CODE:

OUTPUT:

In this case, the independent variables (Years of experience and Age) are positively correlated to the dependent variable (Salary).

  • 1 unit increase in ‘Years of experience’ will lead 5592.85 unit increase in ‘Salary’.
  • 1 unit increase in ‘Age’ will lead 324.49 unit increase in ‘Salary’.

We know that as Years of experience increases the age of the person also increases. They are positively correlated.

CODE:

OUTPUT:

This is the reason for the high standard error for both the independent variables.

And their p values are greater than the alpha value (0.05), This means neither of the independent variables are significant.

Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.

HOW MULTICOLINEARITY AFFECTS THE MODEL?

The coefficients here are still unbiased. They are the best guess for our β1 and β2, but they are very sensitive. Multicollinearity also makes the estimates very sensitive to minor changes in specification. So, if β1 and β2 are correlated, it inflates the variance of affected variables. Their Stander Error would shoot such that both the p values are high, and neither of these 2 variables seems to be statistically significant. It is like both of these variables are fighting for the effect on salary, but gets on each other’s way as they move in the same direction. Regression is having tough time teasing apart the effect on salary from each variable. So the Standard Error gets very high.

HOW TO DETECT MULTICOLLINEARITY.

1 . Bivariate correlation analysis–

0 to 0.5 — Low positive correlation

0.5 to 1.0 — High positive correlation

0 to -0.5 — Low negative correlation

-0.5 to -1.0 — High negative correlation

HOW MUCH IS TOO MUCH CORRELATION?

Commonly correlation value between -0.5 to 0.5 is considered okay. But there is no rule of thumb. It depends on the data we handle.

2. Variance inflation factor (VIF) —

The variance inflation factor (VIF) quantifies the extent of correlation between one predictor and the other predictors in a model

A VIF can be computed for each predictor in a predictive model.

  • A value of 1 means that the predictor is not correlated with other variables.
  • The higher the value, the greater the correlation of the variable with other variables.
  • Values of more than 4 or 5 are sometimes regarded as being moderate to high, with values of 10 or more being regarded as very high.

These numbers are just rules of thumb; in some contexts, a VIF of 2 could be a great problem (e.g., if estimating price elasticity), whereas, in straightforward predictive applications, very high VIFs may be unproblematic.

If one variable has a high VIF it means that other variables must also have high VIFs. In the simplest case, two variables will be highly correlated, and each will have the same high VIF.

VIF is more robust, we are no longer looking at bivariate relation. In VIF we take 1 feature at a time and create a linear regression model.

Suppose there are 4 independent feature — X1, X2, X3, X4

VIF shows how much X1 is being explained by the 3 other variables and how much of its information is already contained in the 3 other variables.

Manual Calculation of VIF

X1 = β0+1*X2)+2*X3)+3*X4)

Then we build a statistical linear model and calculate R2

VIF(X1)= 1/(1-R2)

So suppose R2 is 0.8

VIF(X1)= 1/(1–0.8) = 5

CODE:

OUTPUT:

Not actual output, for understanding purpose

Here X3 has a VIF score of 12. That means X3 is being explained by the other 3 variables.

HOW MUCH IS TOO MUCH VIF?

Normally the threshold is 5 or 10, depending on the data. 3 is used as a threshold if data is huge and of good quality — like audited finance reports.

Why VIF better than correlation analysis?

Correlation is bivariate, i.e. it can show the connection between 2 variable at a time whereas in VIF, X1 can have a weak connection with all the variables, but those week connections together can lead to multicollinearity as well

SOLUTION TO MULTICOLLINEARITY

1. Ignore multicollinearity

  • If the model is used for prediction only, then the standard error of the coefficients are not important. As long the coefficients themselves are unbiased, which they are, we can use our model for prediction. It’s not like it invalidate all the other variables in the model.
  • If the correlated variables are not of particular interest to study question — e.g. If our target variable is ‘Sales’ in an organization and our objective is to find the impact of ‘Advertisement’ on ‘Sales’. Even if some other independent features are correlated, we can just ignore them.
  • If the correlation is not extreme.

2. Remove one of the correlated variables -

If the variables are providing the same information- e.g. ‘Height’, ‘Weight’ and ‘BMI’ will have high VIF score. We can drop ‘Height’ and ‘Weight’ as BMI comprises of height and weight.

NOTE: Beware of omitted variable bias.

3. Combine the correlated variables-

‘Years of experience’ and ‘Age’ — can be combined to create a new feature ‘Seniority score’, So that there is no information loss.

4. PCA -

PCA is a way to deal with highly correlated variables, so there is no need to remove them.

In the following data frame-

If we make a linear regression model for Salary with Years of experience variable alone.

CODE:

OUTPUT:

Cropped image of the output

The p-value is 0.00001 which is not surprising as years of experience effects salary.

Now add the ‘Age’ variable to the regression model — which is highly correlated to the values of ‘Years of experience’.

Will the coefficients of years of experience change? Will SE or p-value change?

OUTPUT:

cropped image of the output

The standard error increased but the coefficient almost remains the same. This proves coefficients are unbiased but SE did ramp up significantly.

PERFECT MULTOCOLLINEARITY

i.e. Coefficient = 1.0

Perfect multicollinearity is when one of the regressors is an exact linear function of the other regressors

If there are 2 perfectly collinear variables, the model fails. We can’t get any regression output. The model is trying to ease out individual effects but we haven’t given any leverage to tease out the individual effect.

REAL WORLD EXAMPLES

  1. Calories burned by a runner in a track = β0+ β1(‘Distance’)+ β2 (‘Laps’)+ β3(‘Hours slept the previous night’) + β4( BMI) +…
  • So if one lap is 200m, we know Distance = no of laps *200. So there is no extra info provided both contain the same info — Here we have to remove one variable. They provide the same info just in different units.

2. Dummy variable — Dummy variables created for a categorical variable with 4 categories, We have to drop any one of the 4 dummy variables to avoid multicollinearity.

Suppose the dummy variable for categorical variable City are –city1, city2, city3, city4

DATASET:

In the above data frame with 4 dummy variable, 1 indicates True and 0 indicate False.

We can observe from the 1st row that the ‘City’ is city1,

But if we check the correlation between city1, city2, city3 and city4.

Output:

We can see that these features are equally correlated among themselves. This is because 1 feature among the 4 can be calculated without having to be a separate variable.

We can make use of drop_first parameter while creating dummy to avoid this correlation between the dummy variables.

Since the values in the first row for the columns city2,city3 and city4 are 0, 0 and 0 respectively, it is a representation of city 1.

city1 is a linear combination of city2, city3 and city4

i.e. city1 = 1 - city2- city3- city4

Wooldridge has a good discussion of multicollinearity in Chapter 3 of his book Introductory Econometrics.- https://economics.ut.ac.ir/documents/3030266/14100645/Jeffrey_M._Wooldridge_Introductory_Econometrics_A_Modern_Approach__2012.pdf

Conclusion

In this article, we had a detailed walkthrough on basics of Multicollinearity, how it affects our model, how to detect it and resolve the problem with some real-world examples.

--

--