Sitemap
Future Vision

A publication centered around high quality storytelling

Collinearity - What it means, Why its bad, and How does it affect other models?

--

Questions:

  1. What is a collinearity or multicollinearity? Why is it bad? What does it look like?
  2. How does it affect our results?
  3. Does it affect decision trees?

1 In statistics, multicollinearity (also collinearity) is a phenomenon in which one feature variable in a regression model is highly linearly correlated with another feature variable.

A collinearity is a special case when two or more variables are exactly correlated.

This means the regression coefficients are not uniquely determined. In turn it hurts the interpretability of the model as then the regression coefficients are not unique and have influences from other features. The ability to interpret models is a key part of being a Data Scientist.

Regardless, if you are just in the business of predicting, you don’t really care if there is a collinearity, but to have a more interpretable model, you should avoid features that have a very high (~R² > .8) being contained in the features.

Below is an image of the data set I am working with, the shows scatter plots of many of the variables in the dataset. Notice how Limit and Rating are so clearly highly correlated. This implies a multicollinearity and takes away from our ability to interpret the beta coefficients from both.

Scatter matrix of variables

So now, if we use linear regression to predict the balance of each person, we can look at our beta coefficients. Unfortunately because of the multicollinearity it becomes harder to understand what is going on:

is Limit or Rating driving the results?

Both limit and rating have positive coefficients, but it is hard to understand if the balance is higher because of the rating or is it because of the limit? I think the driving influencer here is rating, because with a high rating, you achieve a higher credit. So I would remove Limit to get a true idea of how the rating affects the balance.

Notice Rating is higher

Here you can now see that Rating has a higher impact than Limit + Rating did before. This is more interpretable to those who do not understand the math.

2 Even so, between the two models, the model with both variables (Limit & Rating) performed better (by R² scoring). This leads to a discussion on why we care in the first place. We want to use these models to help us to understand the world around us, and figure out where to take our data exploration. Therefore when applying linear regression, you may want to use different models for prediction and one for interpretation/inference.

This same concept can be applied with a Collinearity such as getting the dummy variables for Ethnicity. In this case by keeping all of the dummy variables, you lose the ability to interpret how each variable affects the results. With a Collinearity, removing a column does not affect results.

3 Finally, since these issues affect the interpretability of the models, or the ability to make inferences based on the results, we can safely say that a multicollinearity or collinearity will not affect the results of predictions from decision trees. During inference from the decision tree models though, it is important to take how each feature may be affected by another into account to help make valuable business decisions.

--

--

Responses (3)