Random Forests is a supervised learning algorithm which, just as the name unveils, is an ensemble of several trees (i.e. Decision Tree algorithm). Generally, they are trained via bagging method or sometimes pasting. For this post, I’ll not go into details on what’s Ensemble Learning or Bagging but you can get enlightened on the matter here.
Decision Trees is a great algorithm, nevertheless, just one run on the data will probably not return the best prediction. However, what if you run several decision tree algorithms, with different subsets of your training data and in the end, see which was the most commonly given answer for each instance? This is the basis of Ensemble Learning and thus Random Forest. Just look at the picture for a simple example.
You have your dataset X. You subdivide this dataset into several subsets (either by bagging or pasting). Per subset you train a Decision Tree algorithm. In the end, you have trained several decision trees each returning their prediction for each observation in the dataset. You are now able to choose, per observation, what is the most probable answer to be expected. Individually, the predictions made by each model may not be accurate but combined together those predictions will be closer to the mark on average.
Looking at it step-by-step, this is what a random forest model does:
- Random subsets are created from the original dataset (bootstrapping).
- At each node in the decision tree, only a random set of features are considered to decide the best split.
- A decision tree model is fitted on each of the subsets.
- The final prediction is calculated by averaging the predictions from all decision trees.
Training a Random Forest in Python
Let’s train a Random Forest classifier with 500 trees (each limited to a maximum of 16 nodes):
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as pdiris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris["data"], iris["target"], test_size=0.3, random_state=1)rnd_clf= RandomForestClassifier(n_estimators = 500, max_leaf_nodes= 16, n_jobs = -1, random_state=1)rnd_clf.fit(X_train,y_train)
y_pred_rf = rnd_clf.predict(X_test)
RandomForestClassifier has all the hyperparameters of a
DecisionTreeClassifier to control the tree’s growth, with few exceptions, plus all hyperparameters from
BaggingClassifier to control the ensemble. Moreover, the Random Forest introduces extra randomness when growing trees, since instead of searching for the best feature when splitting a node it searches for the best feature among a random subset of features. With this one can achieve higher tree diversity, avoids overfitting and allows for higher bias and lower variance yielding a better model.
You probably already heard the expression “Feature Importance”. Well, this is one of the best qualities of Random Forests, because it allows you to measure the impact each predictor has on the final prediction.
for name, score in zip(iris[“feature_names”], rnd_clf.feature_importances_):
This feature comes in very handy when you’re trying to figure out which features actually matter, in particular, if you need to perform feature selection.
By tuning the hyperparameters you can achieve two goals, or either increase the predictive power of the model or make it faster.
- Increasing predictive power
n_estimators : Number of trees the algorithm growths before making the prediction. A higher number of tree is expected to increase the performance and make the prediction more stable with the disadvantage of slowing down the computation;
max_features : corresponds to the maximum features Random Forest is allowed to try in an individual tree, when looking for the best split;
min_sample_leaf : minimum required number of leaves to make the split on an internal node.
- Increasing processing speed
n_jobs : indicates the system how many processors it is allowed to use. Value of ‘-1’ means there is no limit;
random_state : makes the model’s output replicable. It will always produce the same results when you give it a fixed value as well as the same parameters and training data.
oob_score : Random Forest cross-validation method. These are called the out-of-bag samples. It is very similar to leave-one-out cross-validation method but without any computational burden.
Random Forest is a great algorithm, for both classification and regression problems, to produce a predictive model. Its default hyperparameters already return great results and the system is great at avoiding overfitting. Moreover, it is a pretty good indicator of the importance it assigns to your features.
The great disadvantage is that a high number of trees might make the computation process much slower and ineffective for real-time predictions.
If you liked it, follow me for more publications and don’t forget, please, give it an applause!
- Hands-on Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron, Chapter 7
- Analytics Vidhya, A Comprehensive Guide to Ensemble Learning (with Python codes)