Unfortunate Linear Model Pitfalls and How To Avoid Them: Multicollinearity

Uluç Şengil
3 min readAug 13, 2018

--

As most people know, linear models (linear regressions, logistic (logit) regressions and maybe even probit and tobit models) are widely used for a first insightful look into your data set. They have the advantage of being easily interpretable and having less variance than more sophisticated models (such as deep-ish neural networks).

However, the ease of interpreting also comes with the cost of needing to understand the underlying assumptions that power them. You might face some unexpected problems if you play fast and loose despite the limitations of the model you’re using. Here’s a couple of them you might have, and some methods to alleviate them.

NOTE: This is not to say more sophisticated models don’t suffer from the same issues. If used for insight into data, such as for observing the relations between the dependent and independent variables, the same kinds of problems will also haunt many kinds of models regardless of sophistication. The saving grace of the more advanced models is that they’re used for prediction rather than insight, and prediction has way more leeway for playing fast and loose.

Multicollinearity

I assume this one stands for “Multiple column linearity” (never mind the geometric definition). This problem arises when there exists a combination of columns that a multiple of them can be summed to fit another column. I believe everyone has faced this problem at least once. To have exact multicollinearity, you might have actually duplicated a column and tried to use both in a regression, or you might have had categorical data for all options while still trying to use an intercept. In these cases, most statistical software kindly alert you of multicollinearity and silently drop one of the variables, fixing your problem for you.

A more problematic situation arises with near-multicollinearity. A typical setup could be like this: You want to model demands for your product, and you have collected earnings, spending AND savings data for your target population. The problem is that simply, spending + savings = earnings. Normally, this would be fixed automatically by your software dropping one of these columns. However, this type of data collection is usually never exact. So the software fails to detect it and runs the regression seemingly successfully. Then you end up with unexpectedly huge coefficients for the columns responsible, scratching your head about “How come these can affect our demand so much? Why do both spending and saving act in the same direction?”

Solution: Ditch the Problematic Columns

I know, removing information from your data analysis seems like a counterproductive thing to do. However, having a multicollinearity problem means that you’re adding redundant data that the linear model doesn’t expect to be handling. Even if you remove the culprits, your model will still have the same amount of expressive power left. You will also left with lower standard errors and more robust coefficients.

When you come across this problem, the first thing you should do is to make a correlation matrix of the variables. I also like to export any such matrix to Excel and use color scales to be able to grasp it all in one look.

Example of such one correlation matrix in Excel

Every correlation will be equal to one in the diagonal due to statistical properties, so just ignore them. For any other values, remove any that equals one exactly. For values that are near but not exactly one, check if your model has the aforementioned issues, and try to find out if removing such variables alleviate the issue.

The correlation matrix helps out a great deal when two columns are very similar, but falls short when the problem is caused by a combination of multiple columns (such as our example about earnings). If there’s such an issue, you can either naively try to remove columns or check out for identity equations in your columns. Once you’ve removed the problematic columns, the linear model will go back to acting as expected.

This is about how deep it goes with respect to multicollinearity. Next time, we will weigh on a more unexpected problem you might have with your variables: endogeneity (or dependence).

--

--