Random Forest Classifier and its Hyperparameters

Published in

Analytics Vidhya

10 min readFeb 23, 2021

Understanding the working of Random Forest Classifier

Data science provides a plethora of classification algorithms such as Support Vector Machine, Naïve Bayes classifier, Logistic Regression, Decision Trees etc. But near the top of the classifier hierarchy is the Random Forest Classifier (there is also the random forest regressor but that is a topic for another day).

To understand the working of a Random Forest classifier, we need to first understand the concept of Decision Trees.

If you are not aware of the concepts of the decision tree classifier, Please spend some time on how the Decision Tree Classifier works before you start learning the working nature of the random forest algorithm. If you would like to learn the implementation of the decision tree classifier, you can check it out from the below articles.

· Implementing the decision tree classifier in Python

· Building decision tree classifier in R programming language

· How to visualize the modeled decision tree classifier

Basic Decision Tree concept:

The decision tree concept is more to the rule-based system. Given the training dataset with targets and features, the decision tree algorithm will come up with some set of rules. The same set of rules can be used to perform the prediction on the test dataset.

Suppose you would like to predict that your friend will like the newly released animation movie or not. To model the decision tree you will use the training dataset, like the animated cartoon characters your friend liked in the past movies.

So once you pass the dataset with the target as your friend will like the movie or not, to the decision tree classifier. The decision tree will start building the rules with the characters your friend likes as nodes and the targets like or not as the leaf nodes. By considering the path from the root node to the leaf node. You can get the rules.

The simple rule could be if some x character is playing the leading role then your friend will like the movie. You can think of a few more rules based on this example.

Then to predict whether your friend will like the movie or not. You just need to check the rules which are created by the decision tree to predict whether your friend will like the newly released movie or not.

In a decision tree algorithm calculating these nodes and forming the rules will happen using the information gain and Gini index calculations.

In a random forest algorithm, Instead of using information gain or Gini index for calculating the root node, the process of finding the root node and splitting the feature nodes will happen randomly. Will look at it in detail in the coming section.

Random Forest Classifier:

Decision trees having low bias and high variance tends to overfit the data. So bagging technique becomes a very good solution for decreasing the variance in a decision tree. Instead of using a bagging model with an underlying model as a decision tree, we can also use Random forest which is more convenient and well optimized for decision trees. The main issue with bagging is that there is not much independence among the sampled datasets i.e. there is correlation.

The advantage of random forests over bagging models is that the random forests make a tweak in the working algorithm of the bagging model to decrease the correlation in trees. The idea is to introduce more randomness while creating trees which will help in reducing correlation.

Random forest is a supervised learning algorithm that is used for both classification as well as regression. However, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, a random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution.

In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high the accuracy results.

The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data science-speak, the reason that the random forest model works so well is:

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

The low correlation between models is the key.

Uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees can move in the correct direction. So the prerequisites for random forest to perform well are:

1. There needs to be some actual signal in our features so that models built using those features do better than random guessing.

2. The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.

The pseudocode for random forest algorithms can split into two stages.

Random forest creation pseudocode.
Pseudocode to perform prediction from the created random forest classifier.

First, let’s begin with random forest creation pseudocode

Random Forest pseudocode:

1. Just like in bagging, different samples are collected from the training dataset using bootstrapping.

2. On each sample, we train our tree model and we allow the trees to grow with high depths.

Now, the difference within random forest is how the trees are formed. In bootstrapping we allow all the sample data to be used for splitting the nodes but not with random forests. When building a decision tree, each time a split is to happen, a random sample of ‘m’ predictors are chosen from the total ‘p’ predictors. Only those ‘m’ predictors are allowed to be used for the split.

m<<p

Why is that?

Suppose in those ‘p’ predictors, 1 predictor is very strong. Now in each sample, this predictor will remain the strongest. So, whenever trees will be built for these sampled data, this predictor will be chosen by all the trees for splitting and thus will result in a similar kind of tree formation for each bootstrap model. This introduces correlation in the dataset and averaging correlated dataset results do not lead to low variance. That’s why in random forest the choice for selecting node for split is limited and it introduces randomness in the formation of the trees as well.

Most of the predictors are not allowed to be considered for split.

Generally, the value of ‘m’ is taken as m ≈√p, where ‘p’ is the number of predictors in the sample.

When m=p, the random forest model becomes bagging model.

3. Among the “m” features, calculate the node “d” using the best split point.

4. Split the node into daughter nodes using the best split.

5. Repeat 1 to 4 steps until the “l” number of nodes has been reached.

6. Build forest by repeating steps 1 to 5 for “n” number times to create an “n” number of trees.

The beginning of the random forest algorithm starts with randomly selecting “m” features out of total “p” features.

In the next stage, we are using the randomly selected “m” features to find the root node by using the best split approach.

In the next stage, We will be calculating the daughter nodes using the same best split approach. We will repeat the first 4 stages until we form the tree with a root node and having the target as the leaf node.

Finally, we repeat 1 to 5 stages to create “n” randomly created trees. These randomly created trees form the random forest.

Random forest prediction pseudocode:

To perform prediction using the trained random forest algorithm uses the below pseudocode.

1. Takes the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome (target).

2. Calculate the votes for each predicted target.

3. Consider the high voted predicted target as the final prediction from the random forest algorithm.

To perform the prediction using the trained random forest algorithm we need to pass the test features through the rules of each randomly created tree. Suppose let’s say we formed 100 random decision trees from the random forest.

Each random tree will predict different targets (outcomes) for the same test feature. Then by considering each predicted target, votes will be calculated. Suppose the 100 random decision trees predicted some 3 unique targets x, y, z then the votes of x is nothing but out of 100 random decision trees how many trees prediction is x.

Likewise for the other 2 targets (y, z). If x is getting the highest votes. Let’s say out of 100 random decision trees 60 trees are predicting the target will be x. Then the final random forest returns the x as the predicted target.

This concept of voting is known as majority voting.

Visual representation of a Random Forest Classifier:

Python Implementation of Random Forest Classifier:

Hyperparameters of Random Forest Classifier:

1. max_depth: The max_depth of a tree in Random Forest is defined as the longest path between the root node and the leaf node.

2. min_sample_split: Parameter that tells the decision tree in a random forest the minimum required number of observations in any given node to split it. Default = 2

3. max_leaf_nodes: This hyperparameter sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree.

4. min_samples_leaf: This Random Forest hyperparameter specifies the minimum number of samples that should be present in the leaf node after splitting a node. Default = 1

5. n_estimators: Number of trees in the forest.

6. max_sample: The max_samples hyperparameter determines what fraction of the original dataset is given to any individual tree.

7. max_features: This resembles the number of maximum features provided to each tree in a random forest.

8. bootstrap: Method for sampling data points (with or without replacement). Default = True

9. criterion: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

Now, manually setting the hyperparameters, and using GridSearchCV for Hyperparameter Tuning:

Using Grid Search to find the best Parameters

Fitting the Training Data with best parameters and calculating the new accuracy

The final accuracy after hyperparameter tuning has increased.

Advantages of a Random Forest Classifier:

· It overcomes the problem of overfitting by averaging or combining the results of different decision trees.

· Random forests work well for a large range of data items than a single decision tree does.

· Random forest has less variance than a single decision tree.

· Random forests are very flexible and possess very high accuracy.

· Scaling of data is not required in a random forest algorithm. It maintains good accuracy even after providing data without scaling.

· Random Forest algorithms maintain good accuracy even when a large proportion of the data is missing.

Disadvantages of a Random Forest Classifier:

· Complexity is the main disadvantage of Random forest algorithms.

· Construction of Random forests are much harder and time-consuming than decision trees.

· More computational resources are required to implement the Random Forest algorithm.

· It is less intuitive in the case when we have a large collection of decision trees.

· The prediction process using random forests is very time-consuming in comparison with other algorithms.

Conclusion:

With this, we conclude our discussion on the working of Random Forest Classifier and how to tune the various hyperparameters of Random Forest. For a better understanding of the Hyperparameter tunning, you can try to work on the “wine quality dataset” yourself.

Download the dataset — https://www.kaggle.com/rajyellow46/wine-quality

YouTube links for Random Forest Classifier:

1. Stats Quest: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

2. Krish Naik: https://www.youtube.com/watch?v=nxFG5xdpDto

Check out my previous article!

ENSEMBLE METHODS — Bagging, Boosting, and Stacking.

I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic in Machine Learning, then do tell it to me in the comments below.

Thanks for reading!

References:

1. iNeuron

2. towards data science

3. Data aspirant

4. https://github.com/Ankit-c2104/Machine-Learning-Notes

Random Forest Classifier and its Hyperparameters

Basic Decision Tree concept:

Random Forest pseudocode:

Random forest prediction pseudocode:

Written by Ankit Chauhan