Dealing with Multicollinearity in Multiple Linear Regression
The biggest issue hindering quality results in regression modeling
If you are familiar with data science, especially with regression modelling, then you are probably familiar with the concepts of covariance and correlation. This post will go over the issue of multicollinearity in multiple linear regression, go over how to create and interpret scatterplots and correlation matrices, and teach you how to identify if two or more predictors have collinearity.
So… Why is Multicollinearity Bad?
The key purpose is to evaluate the relation between each predictor and the outcome variable when doing a regression analysis. The concept of a regression coefficient is that for every 1 corresponding change in a predictor, it reflects the average change in the dependent variable, assuming all other predictor variables are kept constant. It’s precisely for that reason that multicollinearity can trigger issues. Since the theory behind regression is how one variable can be modified and the others kept stable, correlation is a concern when it reveals that changes in one predictor are often correlated with changes in another. Because of that and possible minor variations in the formula, the estimates of the coefficients can have major changes. As a consequence, you may not be able to correctly interpret the coefficients or accept the p-values aligned with correlated predictors.
How to Identify Multicollinearity
You may start by looking at scatterplots among predictors for an initial concept about how the predictors relate. You can use Pandas to create a scatter matrix.
import pandas as pd
#Figsize can be whatever you want based on you needs
The neat thing of this matrix is that it returns scatterplots for connections between two predictors and histograms for a single feature diagonally. This is good, but with a lot of features it will be difficult to review each plot in depth. While talking about correlation, scatter plots that show a linear relation of some kind should capture your attention.
Now let’s look at a correlation matrix.
A correlation matrix returns multivariate correlations rather than displaying scatter plots and histograms. You’ll notice that correlations are always equal to one on the diagonal when they reflect correlations between a variable and the variable itself.
*Note* Correlations have a value of -1 and 1 — 1 being a linear relationship that is entirely positive, and -1 being a linear relationship that is completely negative.
But when can you consider a correlation to be high? Usually, a correlation is called high if it has an absolute value of 0.7–0.8 or higher.
*Tip* If you are working with a bigger correlation matrix, a clever workaround is to use stack and a subset to retrieve only the strongly linked pairs! The code block below creates a concise list of variable pairs and their correlation.
df.drop(columns=['column_1', 'column_2'], inplace=True)
df.columns = ['correlation']
df.drop_duplicates(inplace=True)df[(df.correlation > .70) & (df.correlation < 1)]
Using a heatmap from Seaborn to make the correlation matrix as a visualization is another choice (and is my personal favorite). Below is an example from a project of mine. The more red the boxes are, the higher the correlation between the two variables.
Once you have determined which method you want to use to investigate multicollinearity, and have discovered your problem variables, the solution is to drop some — not all — of the problem variables. Your own discretion and domain knowledge could potentially play a strong factor in determining which variables should be dropped (just make sure you don’t drop your target!).
I hope this blog post helped you better understand what multicollinearity is, why it’s a problem, and how to address it. Thank you for reading!