Effects of Multi-collinearity in Logistic Regression, SVM, Random Forest(RF)

Shivam Raj
3 min readAug 9, 2019

--

Source: http://www.funlava.com/nature-pictures/

What is multicollinearity?

Multicollinearity is a state where two or more features of the dataset are highly correlated. In other words, if two features are f1 and f2, and they can be written in a form:

f₂ = αf + β

then they are said to be collinear.

If there more than two features, let’s say feature are f₁,f₂ , f₃ & f₄ such that

f₁ = α₁ + α₂f₂ + α₃f₃ + α₄f₄

then f₁, f₂, f₃ and f₄ are said to be multicollinear.

How multicollinearity can be a problem?

There are two main problems when there is multicollinearity in between the features. The first problem is vulnerability towards a very small change in the data. This vulnerability arises due to the fact that the weight vectors will change abruptly whenever there is a small change in data. And this changing of weight vectors will make the model very unstable resulting in bad performance.

The second problem is that we cannot use these weight vectors for feature importance directly. We get weight vectors when we train a Logistic Regression model and these weight vectors can directly be used to get feature importance when there is an absence of multicollinearity. The next question which comes to the mind is why can we not use weight vectors to get feature importance when there is multicollinearity between features?

The answer is because when the features are multicollinear then weight vectors can change arbitrarily. Lets take an example, we have our weight vector as :

W* = (3, 4, 5)

and we have our query point as :

Xᵩ = (Xᵩ₁, Xᵩ₂, Xᵩ₃)

Now, Yᵩ = W*ᵀXᵩ = 3Xᵩ₁ + 4Xᵩ₂ + 5Xᵩ₃ …….(i)

Let’s say there is collinearity between first and second feature i.e : f₂ = 1.5f₁, so we can substitute f₂ with 1.5f₃ in equation (i) and the equation becomes :

Yᵩ = 3Xᵩ₁ + 6Xᵩ₁ + 6Xᵩ₃ = 9Xᵩ₁ + 6Xᵩ₃

Now the new weight vector is :

W** = (9, 0, 6)

Hence in W* the most important feature is f₃ but, in W** the most important feature is f₁ and f₂ has no importance at all.

Therefore when feature are multicollinear we cannot use weight vectors to get feature importance in Logistic Regression.

Does multicollinearity affect Support Vector Machines ?

Linear Kernel of Support vector Machines is very similar to Logistic Regression, and hence the effect of multicollinearity has a very similar effect in case of Linear Kernel of SVM. We have to remove multicollinearity , if we want to use weight vectors directly for feature importance.

RBF Kernel is based on distance between the data points, similar to K-Nearest Neighbors. So It doesn’t make much sense to get feature importance in this case, rather we look at data points which has influenced the decision in favor of a class to get an interpretation of the model. Hence we can say that no impact on getting feature importance in the model.

Let’s say we have 5 features and one of the features is repeated 4 times (a extreme case). Then that repeated feature will be contributing 4 times as much to the distance as any other feature, hence the model will be probably much more impacted by this feature.

So RBF kernel of Support Vector Machines is impacted by multicollinearity.

Does multicollinearity affect Random Forest?

Random Forest uses bootstrap sampling and feature sampling, i.e row sampling and column sampling. Therefore Random Forest is not affected by multicollinearity that much since it is picking different set of features for different models and of course every model sees a different set of data points. But there is a chance of multicollinear features getting picked up together, and when that happens we will see some trace of it.

Feature importance will definitely be affected by multicollinearity. Intuitively, if the features have same effect or there is a relation in between the features, it can be difficult to rank the relative importance of features. In other words, it’s not easy to say which one is even if we can access the underlying effect by measuring more than one variable, or if they are mutual symptoms of a third effect.

Conclusion

Multicollinearity definitely affects linear models. We might get lucky in case of non-linear models, but there would some traces of there too.

--

--