Removing Multicollinearity for Linear and Logistic Regression.

Introduction to Multicollinearity

Princeton Baretto

Published in

Analytics Vidhya

4 min readJun 3, 2020

What is Multicollinearity?

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. [This was directly from Wikipedia]. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your target variable, but also to each other.

Now let’s explain this in simple words……

When a column A in our dataset increases, it also affects another column B, it may increase or decrease, but they share a strong similar behavior.

Assume we have a Dataset with 4 Features and 1 Continuous Target Value.

Now If we observe here that as values of X1 column increase the values of X2 are also increasing. This shows that X1 and X2 are somewhat related to each other. In statistical words, the correlation coefficients for X1 and X2 are similar.

In other words, X1 and X2 are highly correlated and hence this situation is called multicollinearity in simple words.

Now you would be thinking that how will it affect the model I am building?

One of the assumptions of linear and logistic regression is that the feature columns are independent of each other. Therefore, Multicollinearity is obviously violating the assumption of linear and logistic regression because it shows that the independent feature i.e the feature columns are dependent on each other.

Also though your model will be giving a high accuracy without eliminating multicollinearity at times, but it can’t be relied on for real-world data. Also, the coefficients become very sensitive to small changes in the model. In simple terms, the model will not be able to generalize, which can cause tremendous failure if your model is in the production environment. Another important reason for removing multicollinearity from your dataset is to reduce the development and computational cost of your model, which leads you to a step closer to the ‘perfect’ model. So be cautious and don’t skip this step!!

So now how can we Detect this multicollinearity?

Using Correlation Coefficient Heat Map

One simple step is we observe the correlation coefficient matrix and exclude those columns which have a high correlation coefficient. The correlation coefficients for your dataframe can be easily found using pandas and for better understanding seaborn package helps to build the heat map.

But wait, won’t this method get complicated when we have many features?

YES!! This will work for smaller datasets but for larger datasets analyzing this would be difficult. Oops….did we got stuck? No worries we have other methods too.

Using VIF ( Variance Influence Factor )

What’s the idea of VIF?

It takes one column at a time as target and others as features and fits a Linear Regression model. After this, it calculates the r square value and for the VIF value, we take the inverse of 1-rsquare i.e 1/(1-rsquare).

Hence after each iteration, we get VIF value for each column (which was taken as target above) in our dataset. Higher the VIF value, higher is the possibility of dropping the column while making the actual Regression model.

Code Snippet for Calculating VIF for each column

But what if you don't want to drop these columns maybe they have some crucial information. If so, you can use this small but useful trick mentioned below:

We can use Ridge or Lasso Regression because in these types of regression techniques we add an extra lambda value which penalizes some of the coefficients for particular columns which in turn reduces the effect of multicollinearity.

That’s how you can detect and remove multicollinearity in a dataset.

Pretty easy right? Go try it out and don’t forget to give a clap if you learned something new through this article!!

Check my GitHub Repository for the basic Python code: https://github.com/princebaretto99/removing_multiCollinearity