Random Forest

RADIO SAYS Arpit pathak
ML_with_Arpit_Pathak
4 min readJun 20, 2020

Hello readers , in this blog we will discuss about one of the most preferred algorithm in Machine Learning for better results over non-linear data , Random Forest Algorithm . We will try to understand how this algorithm works and what implementations of this algorithm are conducted on continuous and classification data .

Random Forest

Random Forest is a supervised machine learning algorithm that works on the principle of ensemble learning and uses a combination of decision trees for making predictions . The training of the random forest model is done through the process of bagging and feature randomness for creating each decision tree so that each tree is almost negligible correlated to the other one .

Image source : here

The idea of Random Forest can be easily explained through its name only . We know that in a forest , there are multiple trees and each tree is different from the other and have different quality and outputs . But the inputs to each trees are same i.e sunlight , water , soil minerals etc . These inputs are required in a different quantity and quality in each tree . So , the inputs to all the trees are same but in a random manner and each tree utilizes these inputs to give a different output . This is the main idea behind the Random Forest algorithm also .

In order to understand the working of random forest , you have to understand some of the basic terminologies —

  1. Ensemble Method : Ensemble method is the process in which we use multiple models to train over our data and combine the outputs/predictions of these models to get the final prediction or output. This type of methods provides better predictions as compared to single model .
  2. Bagging(Bootstrap Aggregating) Concept : Bagging refers to making multiple samples of the training data by selecting random data from the original training set with replacement . This concept is used in ensemble method to create training samples for each model we use . Bootstrapping says that the new sample created from the original sample can contain duplicate data . This method is used to reduce the variance in the ensemble sub models that are Classification and Regression Trees (CART) like Decision Trees .

WORKING OF RANDOM FOREST

The Random Forest works by ensembling a number of decision trees for prediction . When the initial data is given to the Random Forest , n number of samples of data are created based on the concept of Bootstrap aggregating . The number of samples is equal to the number of decision trees used . Each of these samples is given to a decision tree that predicts an output values and then these prediction values of all the decision trees are combined to get the final output or prediction .

Working of Random Forest

Combining outputs of Continuous values: If the output predictions are a continuous values like salary prediction , then the mean or median of the values of the predictions of different decision trees are given as the final output .

Combining outputs in Categorical values : If the ouput predictions are a categorical value , then the majority vote among all the predictions of decision trees is given as the final output .

This is all about the working of the random forest . Now , let us deep dive into the parameter we can change to obtain better accuracy from our Random Forest model —

Hyper Parameters in Random Forest

Hyper Parameters are the values that can be changed in order to control the learning process of the model . The tweaking of these parameters is done make out model more accurate . Let us see some of the hyper parameters that we can tweak in Random Forest —

  1. n_estimators : how many decision trees to use .
  2. max_features : maximum features needed to split a node in each decision tree .
  3. max_depth : maximum levels of each decision tree .
  4. min_samples_split : minimum data values needed in each node to split it .
  5. min_samples_leaf : minimum values to be there in a leaf node of decision tree .
  6. bootstrap : method of sampling . Can be with or without replacing .

Advantages of Random Forest

  • Less affected by outliers or robust to outliers in data .
  • Can be used for both classification and regression problems .
  • Automatic handling of missing values .
  • Solves the problem of high variance .

Disadvantages of Random Forest

  • Complex computing .
  • Requires more time to get trained .

This is all about this blog . Hope it was an informative one . Thank you for reading…

//Machine Learning Internship at Internity Foundation

--

--