Ensembles of tree-based models: why correlated features do not trip them — and why NA matters

Laurae: This post is about decision tree ensembles (ex: Random Forests, Extremely Randomized Trees, Extreme Gradient Boosting…) and correlated features. It explains why an ensemble of tree models is able to learn when some features are (highly or perfectly) correlated. It also shows a practical proof about why choosing NA (missing value) wisely matters, while demonstrating quickly the way you show the data to an algorithm changes the way it is perceived! This is part of a post (edited to include parts for better reading), originally at Kaggle.

Part 1 Summary:

  • Correlated features are not tripping ensembles of tree-based models
  • Showing an identical feature but in a different way typically trips algorithms
  • NA value are to be treated as features as they also typically trip algorithms

Ferris wrote:

That’s weird. I encoded categorical with numeric IDs, and when I removed either v91 or v107, my local CV score drops 2x logloss std.

It depends on the learning model you are using. For instance, if you are using xgboost, it will NEVER (I did not see any case yet myself) be caught wrong when dealing with very highly correlated input features (or even computed features). Due to its nature, if the number of trees is set to an imaginary infinite value, for two perfectly correlated variables, xgboost will pickup exactly 50%|50% each features. When parsing the importance of these variables, xgboost will not be able to tell you it’s 50%|50%, but it is your job to find out if it’s 50%+50% or just 100%.


Example: feature 1 has a gain of 0.8, feature 2 has a gain of 0.15, and feature 3|4 got a gain of 0.045 and 0.005 respectively. Feature 3 and 4 are perfectly correlated but were not picked at 50%|50% rate due to random sampling of features.

Variable Gain
Feat. 1 0.8
Feat. 2 0.15
Feat. 3 0.045
Feat. 4 0.005

It “could” be useful to simplify the model by removing feature 4 which is adding a 0.5% information gain, however as we know features 3 and 4 are perfectly correlated (let’s call it C), you have the following:

Variable Gain
Feat. 1 0.8
Feat. 2 0.15
Feat. C 0.050

Because the model has only one choice (C instead of 3/4), it “should” lower your local CV (and not increase it — if the target to it minimize). However, this was because it was a perfectly linear regression between 3 and 4, not a numerical conversion between two categorical variables.


Example of xgboost on BNP Paribas Cardif data, without any feature engineering, without removing any variable, and setting NAs to -1:

train_temp <- train[, -1]
train_temp[is.na(train_temp)] <- -1
test_temp <- test[, -1]
test_temp[is.na(test_temp)] <- -1
train_temp <- xgb.DMatrix(data = data.matrix(train_temp[, -1]), label = data.matrix(train$target))
test_temp <- xgb.DMatrix(data = data.matrix(test_temp))
set.seed(11111) #ensures reproductability
modelization <- xgb.cv(data = train_temp, nthread = 4, max.depth = 5, eta = 0.5, nrounds = 30, objective = "binary:logistic", eval_metric = "logloss", nfold = 10, verbose = TRUE, early.stop.round = 10, maximize = FALSE, watchlist = train_temp)
Output:
[0] train-logloss:0.552161+0.000833 test-logloss:0.553072+0.001434
[1] train-logloss:0.505766+0.000963 test-logloss:0.507329+0.002392
[2] train-logloss:0.486590+0.001291 test-logloss:0.488928+0.002621
[3] train-logloss:0.478175+0.001016 test-logloss:0.481401+0.003447
[4] train-logloss:0.473670+0.001187 test-logloss:0.477915+0.003624
[5] train-logloss:0.470747+0.001372 test-logloss:0.476212+0.003340
[6] train-logloss:0.468461+0.001092 test-logloss:0.475084+0.003657
[7] train-logloss:0.466052+0.001594 test-logloss:0.473863+0.003589
[8] train-logloss:0.464015+0.001681 test-logloss:0.473059+0.003768
[9] train-logloss:0.462295+0.001403 test-logloss:0.472334+0.004069
[10] train-logloss:0.460983+0.001315 test-logloss:0.472382+0.004054
[11] train-logloss:0.459649+0.001690 test-logloss:0.471929+0.003639
[12] train-logloss:0.458469+0.002070 test-logloss:0.471840+0.003659
[13] train-logloss:0.457102+0.001630 test-logloss:0.471549+0.003627
[14] train-logloss:0.455672+0.001716 test-logloss:0.471236+0.003961
[15] train-logloss:0.454985+0.001908 test-logloss:0.471251+0.004156
[16] train-logloss:0.453844+0.001930 test-logloss:0.471212+0.004162
[17] train-logloss:0.452801+0.001523 test-logloss:0.471122+0.004354
[18] train-logloss:0.451768+0.001602 test-logloss:0.471291+0.004543
[19] train-logloss:0.450714+0.001922 test-logloss:0.471487+0.004362
[20] train-logloss:0.449989+0.002093 test-logloss:0.471538+0.004406
[21] train-logloss:0.448820+0.001753 test-logloss:0.471507+0.004500
[22] train-logloss:0.447932+0.001405 test-logloss:0.471652+0.004470
[23] train-logloss:0.447014+0.001314 test-logloss:0.471867+0.004634
[24] train-logloss:0.446108+0.001667 test-logloss:0.471940+0.004895
[25] train-logloss:0.445270+0.001457 test-logloss:0.471994+0.004931
[26] train-logloss:0.444400+0.001449 test-logloss:0.472205+0.004977
[27] train-logloss:0.443418+0.001545 test-logloss:0.472617+0.004976
Stopping. Best iteration: 18

Adding one perfectly correlated variable to the previous set (duplicating v50):

train_temp <- train[, -1]
train_temp[is.na(train_temp)] <- -1
train_temp$v502 <- train_temp$v50
test_temp <- test[, -1]
test_temp[is.na(test_temp)] <- -1
test_temp$v502 <- test_temp$v50
train_temp <- xgb.DMatrix(data = data.matrix(train_temp[, -1]), label = data.matrix(train$target))
test_temp <- xgb.DMatrix(data = data.matrix(test_temp))
set.seed(11111) #ensures reproductability
modelization <- xgb.cv(data = train_temp, nthread = 4, max.depth = 5, eta = 0.5, nrounds = 30, objective = "binary:logistic", eval_metric = "logloss", nfold = 4, verbose = TRUE, early.stop.round = 10, maximize = FALSE, watchlist = train_temp)
Output:
[0] train-logloss:0.552161+0.000834 test-logloss:0.553072+0.001434
[1] train-logloss:0.505766+0.000963 test-logloss:0.507329+0.002392
[2] train-logloss:0.486590+0.001291 test-logloss:0.488928+0.002621
[3] train-logloss:0.478175+0.001016 test-logloss:0.481401+0.003447
[4] train-logloss:0.473670+0.001187 test-logloss:0.477915+0.003624
[5] train-logloss:0.470747+0.001372 test-logloss:0.476212+0.003340
[6] train-logloss:0.468461+0.001092 test-logloss:0.475084+0.003657
[7] train-logloss:0.466052+0.001594 test-logloss:0.473863+0.003589
[8] train-logloss:0.464015+0.001681 test-logloss:0.473059+0.003768
[9] train-logloss:0.462295+0.001403 test-logloss:0.472334+0.004069
[10] train-logloss:0.460983+0.001315 test-logloss:0.472382+0.004054
[11] train-logloss:0.459649+0.001690 test-logloss:0.471929+0.003639
[12] train-logloss:0.458469+0.002070 test-logloss:0.471840+0.003659
[13] train-logloss:0.457102+0.001630 test-logloss:0.471549+0.003627
[14] train-logloss:0.455672+0.001716 test-logloss:0.471236+0.003961
[15] train-logloss:0.454985+0.001908 test-logloss:0.471251+0.004156
[16] train-logloss:0.453844+0.001930 test-logloss:0.471212+0.004162
[17] train-logloss:0.452801+0.001523 test-logloss:0.471122+0.004354
[18] train-logloss:0.451768+0.001602 test-logloss:0.471291+0.004543
[19] train-logloss:0.450714+0.001922 test-logloss:0.471487+0.004362
[20] train-logloss:0.449989+0.002093 test-logloss:0.471538+0.004406
[21] train-logloss:0.448820+0.001753 test-logloss:0.471507+0.004500
[22] train-logloss:0.447932+0.001405 test-logloss:0.471652+0.004470
[23] train-logloss:0.447014+0.001314 test-logloss:0.471867+0.004634
[24] train-logloss:0.446108+0.001667 test-logloss:0.471940+0.004895
[25] train-logloss:0.445270+0.001457 test-logloss:0.471994+0.004931
[26] train-logloss:0.444400+0.001449 test-logloss:0.472205+0.004977
[27] train-logloss:0.443418+0.001545 test-logloss:0.472617+0.004976
Stopping. Best iteration: 18

Comparing both results:

[17]    train-logloss:0.452801+0.001523 test-logloss:0.471122+0.004354
[17] train-logloss:0.452801+0.001523 test-logloss:0.471122+0.004354

You get exactly the same CV.

However, if you change how you calculate the new “v502” just created (negative slope, i.e. 20 — v502), using three different methods, you get the following:

#encoding NA = -1 (impact all other variables), v502 = 20 - v50
test_temp$v502 <- 20 - test_temp$v50
test_temp[is.na(test_temp)] <- -1
[19] train-logloss:0.450267+0.001919 test-logloss:0.471810+0.003501
Stopping. Best iteration: 20
#encoding NA = 21 (impact all other variables), v502 = 20 - v50
test_temp$v502 <- 20 - test_temp$v50
test_temp[is.na(test_temp)] <- 21
[15] train-logloss:0.454807+0.001185 test-logloss:0.471399+0.004293
Stopping. Best iteration: 16
#encoding NA = -999 (impact all other variables), v502 = 20 - v50
test_temp$v502 <- 20 - test_temp$v50
test_temp[is.na(test_temp)] <- -999
[19] train-logloss:0.450267+0.001919 test-logloss:0.471810+0.003501
Stopping. Best iteration: 20
#encoding NA = -999 (impact all other variables), v50 = 20 - v50 (overwriting)
test_temp$v50 <- 20 - test_temp$v50
test_temp[is.na(test_temp)] <- -999
[19] train-logloss:0.450068+0.001653 test-logloss:0.471861+0.003456
Stopping. Best iteration: 20

Comparison:

[17]    train-logloss:0.452801+0.001523 test-logloss:0.471122+0.004354 #baseline
[19] train-logloss:0.450267+0.001919 test-logloss:0.471810+0.003501 #v502, NA = -1
[15] train-logloss:0.454807+0.001185 test-logloss:0.471399+0.004293 #v502, NA = 21
[19] train-logloss:0.450267+0.001919 test-logloss:0.471810+0.003501 #v502, NA = -999
[19] train-logloss:0.450068+0.001653 test-logloss:0.471861+0.003456 #v502 overwrite, NA = -999

Conclusion: choosing how you set your NAs can improve or lower your CV scores. In this case, there is no direct conclusion to dealing with the scores of v502+NA=21 version. 0.471399+0.004293 is no better than 0.471810+0.003501 (look at the SD before making a conclusion). Also, choose wisely how the variables are looking at: changing the slope of a single feature can change the way a model looks at the data.

If you are using Amelia imputed values, it should be able to kick up your CV by about 0.005 (not bad itself) when comparing two slow learners. However, it multiplies the amount of models you need to do (1 for each imputed data set), then you must do an average.


N.B: If you use a xgboost with fast learning parameters and a small window for depth, you can literally drop ~120 features on the BNP Paribas Cardif data set while improving your local CV (but not by much). However, the major issue is you must calibrate the model predictions, as you end up with an insane amount of false alarms. This, in return, should improve your CV widely. For the predicted values that lands around 0.50 (disputable choice, the mean of target is 76.12%), just harvest them manually in your favorite programming language and use something else to calibrate them appropriately! There seems to be feature engineering in this. 0.469+-0.005 CV seems to be an acceptable target to check for improvements, also.


Part 2 Summary:

  • With tree-based models, you can safely ignore correlation issues
  • With non tree-based models, you must take care of correlation issues, especially when it is explicit hypothesis in the algorithm

Aarshay Jain wrote:

Hi Guys,
Thanks for the great discussion. I just have a question, while we consider linear transformations we should be doing them in terms of linear models right?
So if a variable is a linear combination of a few others, this information might be helpful but a tree-based model will never be able to figure this out. So if we drop it, we might end up performing poorer.
Just sharing my thought as I am using a tree-based model.

Tree-based models have an innate feature of being robust to correlated features. When you drop a correlated variable to others, it will leave room for the tree to use one more variable in its trees. Due to the fact you are opening room for one more variable, it is possible to end up performing poorer. However, you potentially harvest another variable, and the importance of the correlated feature you removed is spread among all other variables (and more specifically to the correlated features you had before with the one you removed).

GBM-based models have an innate feature to assume uncorrelated inputs, it can therefore cause major issues.

For xgboost users: as you are using the combination of both (tree-based model, GBM-based model), adding or removing correlated variables should not hit your scores but only decrease the computing time necessary. You probably know it yourself, as one-hot encoding generates a massive chained correlated variable list (but it does not hit xgboost scores).

Like what you read? Give Laurae a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.