How not to use random forest

Toma Gulea
Turo Engineering
Published in
6 min readJul 12, 2018

--

Random forest is a very popular model among the data science community, it is praised for its ease of use and robustness. Nevertheless, it is very common to see the model used incorrectly.

In this blog post, we will showcase how you can pull completely wrong insights from a random forest regressor using the most popular machine learning library: scikit-learn.

Let’s assume you are asked what are the most important features explaining a variable Y, it’s common to see the following approach:

  • Import sklearn
  • train a random forest with default parameter
  • print feature importance

Let’s apply this approach with simulated data and see how it goes. Simulated data will allow us to know exactly the effects of each variable prior to fitting any model.

Building the example:

The notebook is available on Github.

Simulate data:

We create 4 features, two of them (important_feature_poisson, important_feature_dummy) explain Y by construction and two of them are orthogonal to Y (random_feature_normal, random_feature_dummy).

data = pd.DataFrame(
{
'important_feature_poisson': np.random.poisson(3, 1000)

--

--