Correlation vs Collinearity vs Multicollinearity

Ankit songara
3 min readMar 27, 2022

--

As we know, it is very critical that the predictors used in a supervised machine learning model are not connected with each other for the model to provide a good and accurate estimate of the target variable. Correlation Analysis is the first step in ensuring this.

Correlation:

Correlation is a measure to describe the ‘movement’ of two or more variables. In simple terms, it explains how two or more variables are connected to one another. It offers us an indication of the strength of the link between the two variables. Correlation between two variables can be positive, negative, or zero. A positive correlation indicates that the variables increase or decrease together. A negative correlation indicates that if one variable increases, the other decreases. Zero correlation value means there is no relation or connection between the variables. The value of correlation ranges from -1 to 1.

Collinearity and Multicollinearity:

Collinearity refers to a situation when 2 independent variables(or predictors) have a strong linear relationship. We can say that these two variables are highly correlated and changing one variable can have an impact on the other one. Multicollinearity is a special case of collinearity where 2 or more predictors are correlated with each other(usually having a correlation coefficient >0.7)

Note: Correlation between predictor and target variable is a good thing for a model. However, correlation among the predictors causes multiple problems.

Issues because of multicollinearity

  1. A predictor’s estimate on the response variable will be less exact and less dependable.
  2. When an important predictor has a collinear connection with other predictors, it might become insignificant.
  3. Two correlated predictors may be supplying identical information about the response/target variable, resulting in incorrect predictor coefficients.
  4. Having multicollinearity may result in overfitting i.e., the model will perform well on train data but poorly on test data, thereby contradicting the main objective of the model.

How to identify multicollinearity

  1. Correlation matrix: You may see how the data correlate with each other by creating a correlation matrix with a color gradient background. This scale will range from 0 to 1, with 1 representing perfect correlation.
  2. The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient rises when your predictors are connected. If no association exists, the VIF will be 1. As a result, the higher the number, the more connected the two variables are.

Dealing with multicollinearity

  1. Feature Engineering: If you can aggregate or combine the two features and put them into a single variable, you will no longer have to cope with the strong correlation between the two variables.
  2. Drop variables: Exclude one of the variables that are overly associated with another from the data. Which variable you should get rid of? That depends on your data, but a good general rule is to exclude the one that isn’t as closely connected with the target variable.
  3. PCA: Use principal component analysis (also known as dimensionality reduction) on the data to transform a set of potentially correlated predictors into a set of linearly uncorrelated variables.

When the number of features is large, we may first locate and delete predictors with very high absolute correlations(>0.8), then compute VIFs for further removal, and then, if necessary, use other techniques like PCA to make the remaining features linearly uncorrelated before proceeding to the training step.

--

--