Variance Inflation Factor (VIF)
We have always heard about Multicollinearity whenever we talk about the regression model but we never wonder ways to check this.
I know plotting a correlation matrix using “Matplotlib” will help, but that’s not what I’m looking for.
We gonna discuss the Variance Inflation Factor (VIF) but before that let’s have a quick discussion on Multicollinearity.
- Multicollinearity means independent variables in a model are correlated.
- Multicollinearity among independent variables can reduce the performance of the model.
- Multicollinearity can be a problem in multiple regression because the input variables are all influencing each other. Therefore, they are not actually independent, and it is difficult to test how much the combination of the independent variables affects the dependent variable or outcome.
Hence we need Variance Inflation Factor (VIF) as it is a tool to help measure the degree of multicollinearity.
The formula for VIF is very simple and hence it is very easy to understand, all you need to know is R².
If you are not aware of R² please check out my article on it. LINK
Suppose there are features X1, X2, X3…Xn
We will be calculating the VIF of each feature individually, for that we need to calculate R² the equation for regression for X1 will be
We can calculate R² for each feature using this equation and put that R² in the VIF formula.
VIF value will always be greater than 1. Here are some rules for VIF
- 1 = not correlated.
- Between 1 and 5 = moderately correlated.
- Greater than 5 = highly correlated.
Where VIF shouldn’t be used?
- Polynomial Equation.
- Dummy variable.
- Nominal variable.
Final Thoughts
Multicollinearity reduces the statistical significance of the independent variables. VIF is used to detect these variables. A large variance inflation factor (VIF) on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables.