Random Forest — Ensemble method

Yamini
Geek Culture
Published in
6 min readApr 18, 2021

One of the advanced techniques mostly used for any data(also for non-linear data or real-time data) of both regression and classification problems in Supervised learning algorithms is Random Forest. As the name says it is a forest made of many trees which combine to become Random Forest.

Okay, what is an ensemble method?

Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Ensemble methods are machine learning methods that construct a set of predictive models and combine their outputs into a single prediction. The purpose of combining several models together is to achieve better predictive performance, and it has been shown in a number of cases that ensembles can be more accurate than single models. While some work on ensemble methods has already been done in the 1970s, it was not until the 1990s, and the introduction of methods such as bagging and boosting, that ensemble methods started to be more widely used. Today, they represent a standard machine learning method that has to be considered whenever good predictive accuracy is demanded.

Yes, it is as stated that ensemble methods are being widely used when the ML and advanced learning methods have increased their application. As ensemble methods repeat the process many times such that the model learns the data and makes proper predictions than when the model is using bagging or boosting when it's ensemble methods it has a greater impact in making correct predictions than when the model doesn’t use it. Like when Linear Regression or Logistic Regression or Decision Trees are used it just trains one time on the whole set whereas in ensemble methods it does the task many times like when people do any activity many times they become an expert in it.

Practice makes man perfect. Isn’t it, when the things are made the right way without giving up one day the task becomes much easier and is no more hard for you to do it.

Random Forest is just the group of decision trees combined to give one output or result. What does the decision tree in ML do? If this your question you have to go to Decision Tree first because without the main ingredient or without base there is no tree. Click here to know about the Decision tree

How is one decision tree constructed? Yes, you now know it. It splits using Gini, entropy, and whatever is having higher information gain that is used for splitting a root node and the process goes on to form a decision tree. Such decision trees are combined to form Random Forest which uses a bagging technique which is Bootstraping+Aggregating.

In simple terms, Booststraping is taking a sample from the data with replacement and combining all such samples taken from the population is called Bagging. Bagging is an ensemble technique in which different samples are collected for making a decision. Here in Random Forest, we choose such samples from the population and form decision trees and not one decision tree. Here are there many trees formed from different samples(bootstrap samples)which are all combined to form Random Forest. The average of all the predictions made by the ‘n’ number of decision trees makes the prediction for Random Forest. This type of ensembling is called parallel learning whereas boosting uses sequential learning. Parallel indicates that as the data is trained on different samples then the independence exists between the base learners(each decision tree)then all the scores are attained to find the optimal score. Sequential indicates there is dependence between base learners which is used by all the boosting algorithms. Random Forest uses only the bagging technique which reduces the variance in the dataset.

Random Forest diagram

Bagging helps in reducing the variance in the data because as there many decision trees, the learning increases so that the data is being trained well and also there is a chance of overfitting. In bagging, the outputs of multiple classifiers trained on different samples of the training data are combined which helps in reducing overall variance.

Implementation is just a few lines away when the Random Forest is known.

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
model = RandomForestRegressor(n_estimators = 10, random_state = 0)
model.fit(X_train, y_train)
model.predict(X_test)

Here n_estimators is the number of trees to be constructed for this model, random_state is picking one sample from the population, and when executed any time the sample gets picked if you use the same random state. 😮Such a useful parameter, Yes!!

To know more about the parameters in Random Forest refer to this sklearn site for Classifier and Regressor

Advantages and disadvantages of Random Forest:

Pros:-

  1. It reduces overfitting in decision trees and helps to improve the accuracy
  2. It is flexible to both classification and regression problems
  3. It works well with both categorical and continuous values
  4. It automates missing values present in the data
  5. Normalizing of data is not required as it uses a rule-based approach.
  6. It has the power of handling large datasets with higher dimensionality. It can handle thousands of input variables and identify the most significant variables so it is considered as one of the dimensionality reduction methods. Further, the model outputs the Importance of variable, which can be useful for feature selection
  7. It has methods for balancing errors in data sets where classes are imbalanced.
  8. Here in bootstrap sampling, one-third (say) of the data is not used for training and can be used for testing. These are called the out of bag samples. Error estimated on these out-of-bag samples is known as out-of-bag error. The out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set-aside test set.

Cons:-

  1. It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In the case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.
  2. Random Forest can feel like a black box approach for statistical modelers as there is very little control on what the model does. At best you can try different parameters and random seeds. Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable.
  3. It requires much computational power as well as resources as it builds numerous trees to combine their outputs.
  4. It also requires much time for training as it combines a lot of decision trees to determine the class.

Stand strong, believe in yourself and chase your dreams

There are no limits to what you can achieve except the limits you place on your own thinking. Be limitless. Have courage to do that impossible and make an impact.

If you have learned anything just let me know by showing some support and share it with someone useful. If you have any queries or anything to say let me know in the comment box. Bring some light to the world. Have a nice day. 🥰

--

--

Yamini
Geek Culture

Blogger, Achiever, Data science aspirant, Soulful person, Optimist