Random forest machine learning algorithm

Uma B
2 min readDec 9, 2021

--

The Random forest can be used for classification and regression problems. Random forest will work based on bootstrap sampling. This is the best example of the bagging ensemble methods.

In a random forest, the decision trees will be aligned parallelly. And the training data set will be split as bootstrap samples depending upon the no of trees. for each tree, the bootstrap sample will be different or sometimes some samples will be common. Because we are doing bootstrapping with the replacement it means that we are sampling one data set and replacing again.

The final output of the random forest classifier is the majority votes of the all decision trees outputs. Output for Random forest regressor ,the mean of all the outputs of the decision trees. It means that we are aggregating the outputs at the end.Thats why it is called Booststarp Aggregation.

Each decision tree in the forest will grown to its largest extent. We are taking the aggreation at the end so overfitting will be reduced which will facing by each decision tree.

Out of bag Error(OBE): The data what ever is left after splitting as boostrap sample . For each decision tree we take boostrap sample so the remaining data which is not used to to train the each decision tree is called out of bag data. This is OOB data will be used as testing purpose. so after testing the each decision tree predictions ,will calculate the error between actual and predictions this error is called out of bag error(OOB error).The score of correct predictions for data points in OOB set is called OOB Score.

Hyper Parameters in Random Forest:

  1. n_estimators: This is no of decision trees are used in random forest.Default value for this is 100.
  2. max_depth: The maximum depth of the tree. default value is None.If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  3. Min_sample_split:The minimum number of samples required to split an internal node. default value is 2.

There are many parameters we use for tuning the random forest.But first 2 are the mostly used hyper parameters.

Advantages:

  1. Overfitting in decision tree will over come in random forest because random forest is series of decision trees and output is the combination of all the outputs of decision trees.
  2. Random forest will works efficiently with larger data sets also.
  3. Validation also takes place while training the model itself with the help of out of bag data.
  4. it can use for regression and classfiifcation.

Disadvantages:

  1. It works efficiently for larger data sets but it takes lots of memory for this larger data sets.

--

--