Random forests model interpretation

Guru Pradeep Reddy
6 min readFeb 5, 2019

--

Random forests models are very robust and will work on most datasets. This post helps in interpreting its output which will help us improve the model performance.

Before going into the details if you are interested you can check out a post I wrote explaining the working of random forests here. For purpose of this post, I will be using Blue Book for Bulldozers dataset. The goal here is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration.

Feature importance

Feature importance is the first thing which we need to look whether we are working on a real-life project or kaggle competition. Usually, for any given dataset there will be a lot of columns and sometimes it’s infeasible to explore all the features in detail. In nearly every dataset you use in real life there will be only a handful of columns that you care about. To find those we calculate the feature importance which tells which features really mattered in making the right prediction according to the random forest which we created. Depending on that, we can further explore only the important features in detail.

In random forests variable importance is calculated by the relative importance of the feature which is based on whether the variable was selected on split during tree, at what level the variable was chosen, how much error improved a result. The actual value is calculated as the decrease in error weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature. The error can be measured by different means like Gini index or mean squared error.

Given below is the feature importance(only top 15) plot obtained from Random forest on Blue Book for Bulldozers dataset. Now using this feature importance we can try to remove some of the features which didn’t help the model in any way in making predictions and concentrate on the important features and try to enrich them. As we can see year made is the most important feature and it makes sense as the most recent ones will go for a higher price as they are new.

Feature Importance Plot

Removing redundant features using Dendrogram

Sometimes our dataset may contain features which measure almost the same thing and such features can confuse our variable importance plot and it affects the performance of the model because it has more computation to do the same thing. One way by which we can remove the redundant features is with a help of Dendrogram.

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from Hierarchical Clustering. In hierarchical or agglomerated clustering, we look at every pair of points and say which two points are the closest. We then take the closest pair, delete them, and replace them with the midpoint of the two. Then repeat that again and again. Since we are removing points and replacing them with their averages, you are gradually reducing several points by pairwise combining. To calculate feature similarity instead of looking at points, we look at variables and we can see which two variables are the most similar.

We use the correlation coefficient as a metric to determine the closeness of the features. There are a lot of correlation coefficients but for random forests, rank correlation works better (As they do not care about the linearity, they just care about the order). Spearman’s R was used to plot the dendrogram given below.

The horizontal axis of the dendrogram represents the distance or dissimilarity between features. The vertical axis represents the features. The horizontal position of the split, shown by the short vertical bar, gives the distance (dissimilarity) between the two features. The dendrogram is intuitive to interpret. From the above plot, we can say that SaleYear and SaleElapsed are similar. Likewise, Grouser_Tracks, Hydraulics_Flow and Coupler_System seems to be similar and so are fiBaseModel and fiModelDesc.

After finding all such features which are similar, we can see loop through that list and remove one by one from each group (three groups in this case) and see if it hurts the model performance. If it doesn’t, we can safely remove that feature. Obviously, we cannot remove all the features from a similarity group, but we can try to keep only one if it doesn’t hurt model performance. After removing such redundant features, we get a simpler, smaller and better interpretable model.

Tree Interpreter

Imagine a situation where we have used the random forest to build the model for price estimation of a product. Given a product our model estimates some price but then if someone asks to explain why the model gave that price i.e. how much each feature has contributed to the outcome? Won’t it be nice to have a tool handy in such situations? That’s where Tree Interpreter comes to picture. It lets us know how much each feature has impacted the predicted value.

Before seeing the output of tree interpreter, it would be helpful to visualize how we predict a value for a piece of equipment using a single tree.

Single Decision Tree

Assume we are only using two features ProductSize and fitBaseModel for predicting the value. If a product as those feature values as 1.0 and 32.0 we would predict it’s value as 54016.27(Can be obtained by following the tree path in the figure above). Now let’s try to find out how the feature values contributed to the final price. The mean value of price in the whole dataset is 31132.56(root node’s mean value).

  • As the ProductSize is less than 1.5 the value decreased to 23865.101 from 31132.56
  • But as the fiBaseModel of this is greater than 30.5 its value increased from 23865.101 to 54016.277

So, using a single tree we could break down and see why the value is predicted as 54016.27. At each of the decision points, we are adding or subtracting a little bit from the value. So depending on which feature we are splitting and how much the value is changing we will know the how the feature is impacting the final decision. Similar approach can be followed for random forests, but here instead of one tree we have multiple, so we need to take an average.

Given below is how a random forest predicted a final price of the equipment starting from the mean price and changing that value based on the feature value. The first column shows the feature name and the second one shows its value and third one tells us how much that feature value impacted the final price.

[('ProductSize', 'Mini', -10711.476138131464),
('fiProductClassDesc',
'Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons',
-2649.14348062972),
('saleElapsed', 1284595200, -1683.9341310699049),
('Enclosure', 'EROPS', -1377.9297130991024),
('saleYear', 2010, -1023.9973954073218),
('fiModelDescriptor', nan, -893.4251635183798),
('SalesID', 4364751, -841.8769367770632),
('fiBaseModel', 'KX121', -824.3615641477994),
('fiModelDesc', 'KX1212', -451.67046975023675),
('saleDayofyear', 259, -396.7574060509695),
('fiSecondaryDesc', nan, -382.5772399901316),
('Tire_Size', nan, -347.40121901413283),
('MachineID', 2300944, -344.1756379740019),
('Blade_Type', nan, -211.418896998982),
('saleWeek', 37, -37.30072767993065),
('saleDay', 16, -34.155274963846026),
('Ripper', nan, -25.3176179137301),
('Blade_Extension', nan, 0.0),
('Hydraulics_Flow', nan, 0.0),
('Blade_Width', nan, 0.0),
('Hydraulics', 'Standard', 51.52887544603082),
('state', 'Ohio', 94.42013245805178),
('Grouser_Tracks', nan, 127.84089954337814),
('Coupler_System', nan, 228.59995699361053),
('fiModelSeries', '2', 248.62146227924796),
('ModelID', 665, 1829.0961287686841),
('YearMade', 1999, 2014.078071794381)]

Thus using Tree Interpreter we have the ability to say how a model is making a prediction at a single individual row level. A lot of people say that random forest is a black box model, but as we have seen with the right methods we can Interpret its output. That’s it for this one. If you want more information see the links given below. Thanks for reading!!!.

All the code used in this post can be found here.

Further Reading

  1. https://medium.com/@srnghn/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3
  2. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html
  3. https://github.com/andosa/treeinterpreter

--

--