Random forest is considered one of the most loving machine learning algorithm by data scientists due to their relatively good accuracy, robustness and ease of use.
The reason why random forests and other ensemble methods are excellent models for some data science tasks is that they don’t require as much pre-processing compare to other methods and can work well on both categorical and numerical input data. A simple decision tree isn’t very robust, but random forest which runs many decision trees and aggregate their outputs for prediction produces a very robust, high-performing model and can even control over-fitting.
Now the question is, if everything is to good then what’s the problem with random forest ?
Verily, a forest consists of a large number of decision trees, where each tree is trained on bagged data using random selection of features. So gaining a full understanding of the decision process by examining each individual tree is infeasible. Hence random forests are often considered as a black box.
But is it really so? In this blog we will explain background functioning of random forest and visualize its result.
For keeping it simple lets understand it using iris data. The dataset consists of 3 classes namely setosa, versicolour, virginica and on the basis of certain features like sepal length, sepal width, petal length, petal width we have to predict the class.
1. Feature importance
Among all the features (independent variables) used to train random forest it will be more informative if we get to know about relative importance of features.
Feature importance will basically explain which features are more important in training of model. Sometimes training model only on these features will prove better results comparatively.
The above plot suggests that 2 features are highly informative, while the remaining are not. The plot will give relative importance of all the features used to train model. This plot can be used in multiple manner either for explaining model learning or for feature selection etc.
Implementation of feature importance plot in python
col = [‘SepalLengthCm’ ,’SepalWidthCm’ ,’PetalLengthCm’ ,’PetalWidthCm’]#modelname.feature_importance_
y = estimator.feature_importances_
fig, ax = plt.subplots()
width = 0.4 # the width of the bars
ind = np.arange(len(y)) # the x locations for the groups
ax.barh(ind, y, width, color=”green”)
ax.set_yticklabels(col, minor=False)plt.title(‘Feature importance in RandomForest Classifier’)
fig.set_size_inches(6.5, 4.5, forward=True)
2. Tree plot
Random forest works on several decision tree. Plotting them gives a hunch basically how a model predicts the value of a target variable by learning simple decision rules inferred from the data features.
Every decision at a node is made by classification using single feature. Plotting a decision tree gives the idea of split value, number of datapoints at every node etc.
Considering majority voting concept in random forest, data scientist usually prefer more no of trees (even up to 200) to build random forest, hence it is almost impracticable to conceive all the decision trees. But visualizing any 2–3 trees picked randomly will gives fairly a good intuition of model learning.
Tree plot is very informative but retrieving most of information from tree is a treacherous task.
Every intermediate node consists of following information : feature name, split value , splitting criteria used(default ‘gini’) ,no of samples , no of samples of each class.
Here’s an understanding of tree and its parameters.
- Feature name
- Feature at every node is decided after selecting a feature from a subset of all features.
- Subset forms at random.
- Feature from subset selected using gini(or information gain).
(Note: Gini or information gain any one can be used, gini used usually because it is less computational complex).
2. Split value — split value is decided after selecting a threshold value which gives highest information gain for that split.
(information gain = entropy(parent) — Sum of entropy(child))
This value is selected from the range of feature i.e. best value picked from feature_val_min to feature_val_max.
3.Gini — It is basically deciding factor i.e. to select feature at next node , to pick best split value etc.
4.Samples — No of samples remaining at that particular node.
5.Values — No of samples of each class remaining at that particular node.
(theoretically Sum of values at a node= Samples)
NOTE: As shown above, sum of values at a node > samples , this is because random forest works with duplicates generated using bootstrap sampling.
At every node 63.2% of values are real value and remaining are duplicates generated.
(Just to cross check , compute 63.2% of sum of values at any node it fairly equals to no of samples)
Implementation of tree plot in python
from sklearn.tree import export_graphviz
from sklearn import tree
dotfile = six.StringIO()
i_tree = 0
for tree_in_forest in estimator.estimators_:
(graph,) = pydot.graph_from_dot_file('tree.dot')
name = 'tree' + str(i_tree)
os.system('dot -Tpng tree.dot -o tree.png')
Both above method visualize model learning.
But in many domains usually finance, medicine expert are much more interested in explaining why for a given test sample, model is giving a particular class label.
Hence single sample interpretability is much more substantial.
Some of visualizing method single sample wise are:
2. Feature contribution
3. Waterfall_plot (useful for 2 class classification)
Among various decision tree from ensembles model traversing the path for a single test sample will be sometimes very informative.
Therefore decision tree structure can be analysed to gain further insight on the relation between the features and the target to predict.
This can be carried out using estimator attribute of decision tree. The decision estimator has an attribute called tree_ which stores the entire
tree structure and allows access to low level attributes. The binary tree
tree_ is represented as a number of parallel arrays. The i-th element of each
array holds information about the node `i`. Node 0 is the tree's root.
NOTE:Some of the arrays only apply to either leaves or split nodes, resp. In this case the values of nodes of the other type are arbitrary! Among those arrays, we have:
- left_child, id of the left child of the node
- right_child, id of the right child of the node
- feature, feature used for splitting the node
- threshold, threshold value at the node
For a single test sample we can traverse the decision path and can visualize how a particular test sample is assigned a class label in different decision tree of ensembles model.
Implementation of decision path in python
#Refer scikit learn build functionhttp://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html# following code will print all the tree as per desired output according to scikit learn function.for i,e in enumerate(estimator.estimators_):print("Tree %d\n"%i)
suffix=i, sample_id=1, feature_names=["Feature_%d"%i for i in range(X.shape)])
Any prediction on a test sample can be decomposed into contributions from features, such that:
Table 2 shows some of the test sample from dataset picked randomly, our objective is to determine every feature contribution in determining class label which in value form shown in table 3.
It shows petal length and sepal width are more contributing in determining class label.
Negative value shows feature shifting away from a corresponding class and vice versa.
Plotting this data using bar plot we can get contribution plot of features.
Contribution plot is very helpful in finance, medical etc domains. Experts are curious to know which feature or factor responsible for predicted class label.Contribution plot are also useful for stimulating model.
Implementation of feature contribution plot in python
from treeinterpreter import treeinterpreter as tiprediction, bias, contributions = ti.predict(estimator, X_test[6:7])
N = 5 # no of entries in plot , 4 ---> features & 1 ---- class labelsetosa = 
versicolor = 
virginica = for j in range(3):
list_ = [setosa ,versicolor ,virginica]
for i in range(4):
val = contributions[0,i,j]
virginica.append(prediction[0,2]/5)fig, ax = plt.subplots()
ind = np.arange(N)
width = 0.15
p1 = ax.bar(ind, setosa, width, color='red', bottom=0)
p2 = ax.bar(ind+width, versicolor, width, color='green', bottom=0)
p3 = ax.bar(ind+ (2*width), virginica, width, color='yellow', bottom=0)ax.set_title('Contribution of all feature for a particular \n sample of flower ')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(col, rotation = 90)
ax.legend((p1, p2 ,p3), ('setosa', 'versicolor', 'virginica' ) , bbox_to_anchor=(1.04,1), loc="upper left")ax.autoscale_view()
Random forest is a commonly used model in machine learning, and is often referred to as a black box model. In many cases, it out performs many of its parametric equivalents, and is less computationally intensive to boot.Using above visualizing methods we can understand and make others understand the model and its training.
For any suggestion or queries, leave your comments below.
You can connect me on linkedin