Multicollinearity — How does it create a problem?
Understand in-depth how multicollinearity can affect the model’s performance
During regression analysis, we check many things before actually performing regression forehand. We check if the independent values are correlated, we check if the feature we are selecting is significant or not and also if there are any missing values and if yes then how to handle them.
First, let’s understand what Dependent and Independent Variables are —
- Dependent variables, the value which has to be predicted during regression. Also, known as the target value.
- Independent variables, the values which we have to use to predict the target value or dependent variable. Also, known as predictors.
If we have an equation like this
y = w*x
Here, y is the dependent variable and w is the independent variable.
We’ll see later how it is detected but first, let’s see what problem will be there if variables are correlated.
Understanding Conceptually —
Imagine you went to watch a rock band’s concert. There are 2 singers, a drummer, a keyboard player, and 2 guitarists. You can easily differentiate between the voice of singers as one is male and other is female but you seem to have trouble telling who is playing better guitar.
Both guitarists are playing on the same tone, same pitch and at the same speed. If you could remove one of them then it wouldn’t be a problem since both are almost same.
The benefit of removing one guitarist is cost-cutting and fewer members in the team. In machine learning, it is fewer features for training which leads to a less complex model.
Here both guitarists are collinear. If one plays the guitar slowly then another guitarist also plays the guitar slowly. If one plays faster then other also plays faster.
If two variables are collinear that means if one variable increases then other also increase and vice-versa.
Understanding Mathematically —
Let’s consider the equation
Consider A and B are highly correlated.
y = w1*A + w2*B
The coefficient w1 is the increase in y for every unit increase in A while holding B constant. But practically it not possible since A and B are correlated and if A increases by unit then b also increase by some unit. Hence, we cannot check the individual contribution of either A or B. The solution is to remove either of them.
Checking for Multicollinearity —
There are 2 ways multicollinearity is usually checked
- Correlation Matrix
- Variance Inflation Factor (VIF)
Correlation Matrix — A correlation matrix is a table showing correlation coefficients between variables.
We are not going to cover how the correlation matrix is calculated.
I consider values above 0.75 as highly correlated.
Variance Inflation Factor — Variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. VIF value can be interpreted as
- 1 (Non-collinear)
- 1–5 (Medium collinear)
- >5 (Highly collinear)
The values having VIF value above 5 are removed.
Conclusion —
Multicollinearity can significantly reduce the model’s performance and we may not know it. It is a very important step during the feature selection process. Removing multicollinearity can also reduce features which will eventually result in a less complex model and also the overhead to store these features will be less.
Make sure to run the multicollinearity test before performing any regression analysis.