A Trip to Random Forest…

Jocelyn D'Souza
GreyAtom
Published in
3 min readMar 20, 2018

Welcoming you to a beginner’s guide to the random forest. :)

Random Forest is an ensembling method and one of the most popular and powerful algorithm in Machine Learning. Before we move ahead, hope you have a basic idea of a Decision Tree. In this post, we will learn about the working of Random Forest algorithm and other important features about it.

Source: Yhat

What is Random Forest?

Random Forest is a supervised learning algorithm and capable of performing both regression and classification tasks. As the name suggests, Random Forest algorithm creates a forest with a number of decision trees as shown below:

Source: Analytics Vidhya

How does Random Forest work?

As we see, there is multiple decision trees as base learners. Each decision tree is given a subset of random samples from the data set (Thus, the name ‘Random). The Random Forest algorithm uses Bagging (Bootstrap Aggregating) which we learned in ensemble methods. The general idea of the ensemble methods is that a combination of learning models increases the overall result. In bagging, we train each base learner (i.e Decision Tree) on a different sample of data and the sampling of data points happens with replacement.

Let’s take an example:

We have a training dataset : [X1, X2, X3, … X10]. Random forest may create three decision trees taking the input from subset using bagging as shown below:

So finally, it predicts based on the majority of votes (in case of Classification) or aggregation (in case of Regression) from each of the decision trees made.

Advantages of Random Forest

  1. Random Forest algorithm avoids overfitting
  2. For both classification and regression task, the same random forest algorithm can be used
  3. The Random Forest algorithm can be used for identifying the most important features from the training dataset. It helps in feature engineering.
Source: erinshellman

Disadvantages of Random Forest

  1. Random Forest is difficult to interpret. Because of averaging the results of many trees becomes hard for us to figure out why a random forest is making predictions the way it is.
  2. Random Forest takes a longer time to create. It is computationally expensive compared to a Decision Tree.

Tuning in Random Forest

As we know, strings of a guitar may loosen up and sound different over time. Hence, strings must be adjusted i.e tightened or loosened so they produce the “right ”sound. In the same manner, the parameters in the random forest are tuned to increase the predictive power or to make the model faster.

Taking you through sklearn’s few inbuilt parameters:

n_estimators is the number of decision trees that the algorithm creates. As the number tree increases, the performance increases and the predictions are more stable but it slows down the computation.

max_features is the maximum number of features that the algorithm can assign to an individual tree.

n_jobs is the number of jobs to run in parallel. If n_jobs=1, it uses one processor. If n_jobs=-1, then the number of jobs is set to the number of cores available.

max_depth is the maximum depth of the tree.

criterion is the function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

To view the complete parameter list, visit sklearn’s Random Forest module.

Now, we come to the end of this short trip, I hope I have walked you through the basics of Random Forest algorithm. :)

Thanks for reading! ❤

Follow for more updates!

--

--