How not to use random forest

Published in

Turo Engineering

6 min readJul 12, 2018

Random forest is a very popular model among the data science community, it is praised for its ease of use and robustness. Nevertheless, it is very common to see the model used incorrectly.

In this blog post, we will showcase how you can pull completely wrong insights from a random forest regressor using the most popular machine learning library: scikit-learn.

Let’s assume you are asked what are the most important features explaining a variable Y, it’s common to see the following approach:

Import sklearn
train a random forest with default parameter
print feature importance

Let’s apply this approach with simulated data and see how it goes. Simulated data will allow us to know exactly the effects of each variable prior to fitting any model.

Building the example:

The notebook is available on Github.

Simulate data:

We create 4 features, two of them (important_feature_poisson, important_feature_dummy) explain Y by construction and two of them are orthogonal to Y (random_feature_normal, random_feature_dummy).

data = pd.DataFrame(
    {
        'important_feature_poisson': np.random.poisson(3, 1000)…

How not to use random forest

Building the example:

Simulate data:

Written by Toma Gulea