Variance Inflation Factor (VIF) and its relationship with multicollinearity.
Was sitting around a tea shop with friends having a nice weekend gossip, while the tea was being prepared, when one of my friends suddenly asked, “do you know what is the most important thing that determines the taste of tea ?” One of us answered , “it’s the sugar”, someone said “no it’s the tea leaves”, which forced me to think “what really determines the outcome of an event? Is there the measure?” . That was when I was first introduced to the term ‘Feature Engineering’.
What is Feature Engineering ?
Feature engineering is the science of analyzing various factors or events that determines or influences the outcome of an event. The amount of importance ‘the sugar’ has on the taste of tea or ‘how important are the tea leaves’, are all features which determine the output i.e., ‘The taste of the tea’ (say it can be quantified).
While doing feature engineering we often come across the term Multicollinearity. Multi-collinearity simply means there is some relationship between the feature variables (usually linear relation). Let’s consider an example to understand this better:
The above Data frame shows the salary of an individual which is our target variable i.e., the one we are going to predict, and the other two variables namely ‘age’ and ‘experience’ are two features determining the salary .
In the above example it clearly shows that there is some kind of relationship between the feature AGE and feature EXPERIENCE .
So what, even if they are collinear ? Would that be a problem ? If so, then how ? The biggest problem is if your features are correlated with each other , then you won’t be able to figure out which is the one which is affecting the target variable the most, in simple words it becomes difficult to distinguish between them . The whole point of feature engineering is to keep only the most significant feature , that makes our model robust and efficient .
How do we calculate Multi-collinearity ?
This is where VIF (Variance Inflation Factor) come into the picture. The variance inflation factor (VIF) identifies the strength of correlation among the predictors (features). What is it’s formula ?
Here Rj (squared) is the r-squared value of the feature , j — is the feature which we are calculating the VIF for . Say in the above example we want to calculate the VIF of Age . To do so we’ll consider all other features (here it is only Experience) as features and consider the feature for which we are calculating the VIF for as our target , so here for calculating VIF-Age :
Formula to Calculate R
The VIF is coming out to be more than 10, which shows a high correlation between Age and Experience . A rule of thumb for interpreting the variance inflation factor:
- 1 = not correlated.
- Between 1 and 8= moderately correlated.
- Greater than 8= highly correlated.
Exactly how large a VIF has to be before it causes issues is a subject of debate. What is known is that the more your VIF increases, the less reliable your regression results are going to be. In general, a VIF above 10 indicates high correlation and is a cause for concern .
A Pythonic Way of Calculating VIF
Though this is a short cut to easily find out the VIF, getting to know how is it calculated gives you a better understanding of the concept .
I hope this cleared your understanding of VIF and multicollinearity . See you soon .