Random Forest Interpretation

Part 1: Standard Deviation and Feature Importance

Rrohan.Arrora
Analytics Vidhya
4 min readAug 4, 2019

--

Today, we will discuss interpretation techniques, some known, some new. Interpretation techniques help us to understand our dataset better and therefore, better training of the model. These techniques come more handy and useful when you are unsure about your results and probably when you want to top kaggle leaderboard. In this post, I will cover two interpretation techniques. Let’s get started. I am using fastai library. I would also recommend the readers go through my previous post to understand the below post better.

Are you confident about your predictions?

Generally, if we have ten estimators and for every single row of our validation dataset, we predict the outcome from all the estimators and take the mean of all the predictions and compare it with the one we already have. But what if we want to know how confident we are about every estimation?
Here come very interesting ides of standard deviation. Rather than just taking mean of predictions, we should also focus on the standard deviation of the predictions. If the standard deviation comes out to be significant, that means that each prediction is very different from each other. Thus, each estimator is very different from the other. Therefore, we need to consider our hyperparameters again before we could pass our model for production — this not the case with the average mean of predictions. If our standard deviation comes out to be less, then we are confident of our estimators as well. So the standard deviation of the predictions across the trees gives us at least a relative understanding of how confident we are of this prediction. This is not something which exists in sci-kit-learn, so we have to create it.

It is effortless to calculate the standard deviation of predictions.

Calculating standard deviation comes handier when we have less dataset for a particular type. In that case, deviation of values may be an immense value, and thus we may get uncertain about our decisions.

Praises! Now you are unquestionably more bold about your dataset.

What features are more critical, and what are not?

Now, we are going to discuss the essential thing while training your model, i.e. FEATURE IMPORTANCE. It is so apparent in real life as well that we only focus on things which genuinely matters or which are more influential than others in different ways. Same goes with our model also. Wouldn’t it be better if we could get to know about more significant features in our dataset, and then we focus on those parameters only rather than centering on the entire universe of parameters? YES, it is so better and so less of the burden on our model. So, let’s explore this term in detail.

Fastai library provided a convenient method to calculate the feature importance, i.e., rf_feat_importance. It returns a pandas data frame with columns and their importance sorted in descending order.

Now, that you have essential features on your side, you may train your model with the one which matters. Indeed, your model prediction is not going to go worse with feature importance. It will either remain the same or may get better by any amount. If it gets worse in an unanticipated scenario, that means that the features are independent of each other and are not redundant at all. In that case, you have to consider all the features. Bad Luck!
Broadly speaking, if your validation score gets worse after feature engineering, then there may be two reasons for it.

  • Overfitting: If you are overfitting that means you are using a considerable dataset with a small set_rf_samples and you have applied feature engineering that is respective to that small of data. You have not taken into account the variations that may be useful to train your dataset. You certainly have to be an excellent practitioner to address such type of issue.
  • Unexpected conditions: The second reason your validation score can get worse, if you’re not overfitting, that means you’re doing something right in the training set but not accurate in the validation set. So this can only happen when your validation set is not a random sample.

How powerful is any feature measured?

  1. We traditionally train our model.
  2. Now, if we want to calculate how much important is our feature, there are two ways.
    a. Remove the respective independent variable from the dataset and train your model. Then compare the outcome which may be RMSE, R² or any other metric depending TnCs of kaggle competition.
    b. Shuffle the values of the independent variable column and then train your model.
  3. If you remove the column and then compare the outcomes, there are certain disadvantages to this approach.
    a. Firstly, it will be very time-consuming.
    b. Secondly, you are underestimating the interactions among the variables.
  4. Thus shuffling the values is probably an intellectual approach. If we shuffle them, it will add randomness, and we can both capture the interactions and the importance of the feature.
  5. In this way, you know about the vital columns.

What is data leakage?
Data leakage is a problem in which we train our model on the information which doesn’t matter too much or the information which is not required otherwise such that our model predicts something hard to digest. This information may come from outside.

Data leakage is a problem mainly in complex datasets. One way to know about this is to have a proper validation set such that predicted values differ by a considerable amount in training and validation dataset.

I will cover other interpretation techniques in season 2.

--

--