Ensembles of tree-based models: why correlated features do not trip them — and why NA matters

Laurae
Laurae
Aug 22, 2016 · 7 min read

Laurae: This post is about decision tree ensembles (ex: Random Forests, Extremely Randomized Trees, Extreme Gradient Boosting…) and correlated features. It explains why an ensemble of tree models is able to learn when some features are (highly or perfectly) correlated. It also shows a practical proof about why choosing NA (missing value) wisely matters, while demonstrating quickly the way you show the data to an algorithm changes the way it is perceived! This is part of a post (edited to include parts for better reading), originally at Kaggle.

Part 1 Summary:

  • Correlated features are not tripping ensembles of tree-based models
  • Showing an identical feature but in a different way typically trips algorithms
  • NA value are to be treated as features as they also typically trip algorithms

Ferris wrote:

That’s weird. I encoded categorical with numeric IDs, and when I removed either v91 or v107, my local CV score drops 2x logloss std.

It depends on the learning model you are using. For instance, if you are using xgboost, it will NEVER (I did not see any case yet myself) be caught wrong when dealing with very highly correlated input features (or even computed features). Due to its nature, if the number of trees is set to an imaginary infinite value, for two perfectly correlated variables, xgboost will pickup exactly 50%|50% each features. When parsing the importance of these variables, xgboost will not be able to tell you it’s 50%|50%, but it is your job to find out if it’s 50%+50% or just 100%.

Example: feature 1 has a gain of 0.8, feature 2 has a gain of 0.15, and feature 3|4 got a gain of 0.045 and 0.005 respectively. Feature 3 and 4 are perfectly correlated but were not picked at 50%|50% rate due to random sampling of features.

It “could” be useful to simplify the model by removing feature 4 which is adding a 0.5% information gain, however as we know features 3 and 4 are perfectly correlated (let’s call it C), you have the following:

Because the model has only one choice (C instead of 3/4), it “should” lower your local CV (and not increase it — if the target to it minimize). However, this was because it was a perfectly linear regression between 3 and 4, not a numerical conversion between two categorical variables.

Example of xgboost on BNP Paribas Cardif data, without any feature engineering, without removing any variable, and setting NAs to -1:

Adding one perfectly correlated variable to the previous set (duplicating v50):

Comparing both results:

You get exactly the same CV.

However, if you change how you calculate the new “v502” just created (negative slope, i.e. 20 — v502), using three different methods, you get the following:

Comparison:

Conclusion: choosing how you set your NAs can improve or lower your CV scores. In this case, there is no direct conclusion to dealing with the scores of v502+NA=21 version. 0.471399+0.004293 is no better than 0.471810+0.003501 (look at the SD before making a conclusion). Also, choose wisely how the variables are looking at: changing the slope of a single feature can change the way a model looks at the data.

If you are using Amelia imputed values, it should be able to kick up your CV by about 0.005 (not bad itself) when comparing two slow learners. However, it multiplies the amount of models you need to do (1 for each imputed data set), then you must do an average.

N.B: If you use a xgboost with fast learning parameters and a small window for depth, you can literally drop ~120 features on the BNP Paribas Cardif data set while improving your local CV (but not by much). However, the major issue is you must calibrate the model predictions, as you end up with an insane amount of false alarms. This, in return, should improve your CV widely. For the predicted values that lands around 0.50 (disputable choice, the mean of target is 76.12%), just harvest them manually in your favorite programming language and use something else to calibrate them appropriately! There seems to be feature engineering in this. 0.469+-0.005 CV seems to be an acceptable target to check for improvements, also.

Part 2 Summary:

  • With tree-based models, you can safely ignore correlation issues
  • With non tree-based models, you must take care of correlation issues, especially when it is explicit hypothesis in the algorithm

Aarshay Jain wrote:

Hi Guys,

Thanks for the great discussion. I just have a question, while we consider linear transformations we should be doing them in terms of linear models right?

So if a variable is a linear combination of a few others, this information might be helpful but a tree-based model will never be able to figure this out. So if we drop it, we might end up performing poorer.

Just sharing my thought as I am using a tree-based model.

Tree-based models have an innate feature of being robust to correlated features. When you drop a correlated variable to others, it will leave room for the tree to use one more variable in its trees. Due to the fact you are opening room for one more variable, it is possible to end up performing poorer. However, you potentially harvest another variable, and the importance of the correlated feature you removed is spread among all other variables (and more specifically to the correlated features you had before with the one you removed).

GBM-based models have an innate feature to assume uncorrelated inputs, it can therefore cause major issues.

For xgboost users: as you are using the combination of both (tree-based model, GBM-based model), adding or removing correlated variables should not hit your scores but only decrease the computing time necessary. You probably know it yourself, as one-hot encoding generates a massive chained correlated variable list (but it does not hit xgboost scores).

Data Science & Design

All about Data Science, Machine Learning, and Design.

Laurae

Written by

Laurae

Laurae’s Data Science & Design curated posts

Data Science & Design

All about Data Science, Machine Learning, and Design. Also, lot of things about Statistics, Data Visualization, Benchmarking, and funny stuff.

Laurae

Written by

Laurae

Laurae’s Data Science & Design curated posts

Data Science & Design

All about Data Science, Machine Learning, and Design. Also, lot of things about Statistics, Data Visualization, Benchmarking, and funny stuff.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store