About Random Forest Algorithms.

Dishant kharkar

8 min readJul 9, 2023

What is Random Forest?

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique.
It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and improve the model’s performance.
Random Forest is an ensemble learning method combining bagging principles and decision trees.
As the name suggests, “Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.”

Before diving into the Random forest algorithm. first, we have to understand the Ensemble technique and Bagging.

Ensemble Learning:

Ensemble learning is a machine learning technique that combines multiple individual models (called base models or weak learners) to create a stronger, more accurate model.
It is based on the idea that a group of weak learners can work together to make better predictions than a single strong learner.
The individual models in an ensemble can be trained using different algorithms or with the same algorithm but on different subsets of the training data.
The three main classes of ensemble learning methods are bagging, stacking, and boosting.

In this Article, we study Bagging for random forest algorithm.

Bagging :

Bagging, which stands for Bootstrap Aggregating, is a popular ensemble learning technique in machine learning.
It aims to improve the accuracy and stability of supervised learning models by combining the predictions of multiple base models trained on different subsets of the training data.

The bagging process involves the following steps:

Bootstrap Sampling: Given a training dataset of size N, bagging generates multiple random subsets, called bootstrap samples, by sampling N instances with replacement. Each bootstrap sample has the same size as the original dataset, but some instances may be repeated while others may be omitted.
Base Model Training: A base model is trained on each bootstrap sample independently. The base model can be any supervised learning algorithm, such as decision trees, random forests, support vector machines, or neural networks. The goal is to create diverse models that capture different aspects of the data.
Aggregation of Predictions: Once the base models are trained, they are used to make predictions on new, unseen data. In classification tasks, the predictions are often combined using majority voting, where the class with the most votes is selected as the final prediction. For regression tasks, the predictions can be averaged.

The key idea behind bagging is that by training multiple models on different bootstrap samples, the ensemble can effectively reduce the variance of the predictions. The individual models may have high variance or overfit the training data, but when combined, they tend to produce more stable and accurate predictions. Additionally, bagging can help in reducing the impact of outliers or noisy data points.
Bagging is commonly used with decision trees, resulting in a technique called Random Forests. Random Forests combine bagging with additional randomness in the tree-building process, such as randomly selecting a subset of features at each split. This further enhances the diversity and robustness of the ensemble.
Overall, bagging is a powerful technique in machine learning that can be applied to various algorithms to improve prediction accuracy and reduce overfitting. It is widely used in practice and has proven to be effective across different domains and datasets.

Why use Random Forest, when we have a Decision tree?

Improved accuracy: Random Forest typically provides better accuracy compared to a single Decision Tree, especially when dealing with complex datasets. It reduces overfitting by aggregating predictions from multiple trees, which helps to minimize bias and variance.
Reduced overfitting: Decision Trees have a tendency to overfit the training data, meaning they can capture noise or outliers and make overly complex models. Random Forest addresses this issue by combining multiple trees and averaging their predictions, resulting in a more robust and generalized model.
Handling high-dimensional data: Random Forest handles high-dimensional data well. It can effectively deal with a large number of input variables without feature selection or dimensionality reduction techniques. In contrast, Decision Trees may struggle with high-dimensional data due to increased complexity and potential overfitting.
Feature importance: Random Forest provides a feature importance measure based on the average impurity reduction achieved by each feature across the ensemble of trees. This information can be valuable for feature selection or understanding the importance of different variables in the model.
Resistance to noise and outliers: Random Forest is inherently resistant to noise and outliers in the training data. Individual Decision Trees can be sensitive to these data points and may make biased predictions. The ensemble nature of Random Forest reduces the impact of outliers and increases model robustness.
Efficient for large datasets: Random Forest can efficiently handle large datasets and perform parallel computing. By constructing multiple trees simultaneously, it can exploit the power of multi-core processors and significantly speed up the training process.
Ability to handle missing data: Random Forest can handle missing values in the data without the need for extensive data preprocessing. It uses the available features to make predictions and does not require the imputation of missing values.

However, it’s important to note that Decision Trees also have their advantages, such as being easier to interpret and visualize, and they can capture specific relationships between features and the target variable. In some cases, a Decision Tree may be sufficient if interpretability is crucial, the dataset is small, or there are specific requirements that favour a single tree model.

How does the Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision trees, and the second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for the decision trees that you want to build.

Step-4: Repeat Steps 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest classifier predicts the final decision. Consider the below image:

Important Hyperparameters in Random Forest:

Hyperparameters are used in random forests to either enhance the performance and predictive power of models or to make the model faster.

Hyperparameters to Increase the Predictive Power:

n_estimators: Number of trees the algorithm builds before averaging the predictions.
max_features: Maximum number of features random forest considers splitting a node.
Bootstrap: Sampling with or without replacement

Bootstrap — True — Sampling with replacement

Bootstrap — False — sampling without replacement.

Max- depth: height /depth of the tree.
mini_sample_leaf: Determines the minimum number of leaves required to split an internal node.
criterion: How to split the node in each tree? (Entropy/Gini impurity/Log Loss)
max_leaf_nodes: Maximum leaf nodes in each tree

Hyperparameters to Increase the Speed:

n_jobs: it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor, but if the value is -1, there is no limit.
random_state: controls randomness of the sample. The model will always produce the same results if it has a definite value of random state and has been given the same hyperparameters and training data.
oob_score: OOB means out of the bag. It is a random forest cross-validation method. In this, one-third of the sample is not used to train the data; instead used to evaluate its performance. These samples are called out-of-bag samples.

Advantages of Random Forest:

Robustness: Random Forest is highly robust to noisy data and outliers. By aggregating predictions from multiple trees, it can mitigate the impact of individual trees that may be biased or overfit to the training data.
Feature Importance: Random Forest provides a measure of feature importance, indicating which features contribute the most to the predictions. This information can be valuable for feature selection and understanding the underlying relationships in the data.
Non-linearity Handling: Random Forest can effectively model complex, non-linear relationships between input features and output variables. It can capture interactions and non-linear patterns that may be challenging for other algorithms, such as linear regression.
Scalability: Random Forest can handle large datasets with a high number of features. It can efficiently parallelize the training process, making it suitable for big data applications.

Limitations and Considerations:

Interpretability: While Random Forest provides useful feature importance measures, the overall model is often considered less interpretable compared to simpler models like decision trees. The inherent complexity of the ensemble makes it challenging to extract concise explanations.
Training Time: Random forests can be computationally expensive, especially when dealing with a large number of trees or high-dimensional data. However, advancements in hardware and parallel computing techniques have mitigated this limitation to a large extent.
Hyperparameter Tuning: Random Forest has several hyperparameters, such as the number of trees, the depth of trees, and the number of features considered at each split. Proper tuning of these parameters is essential to optimize the model’s performance, which may require additional computational resources.

Applications of Random Forest:

Random Forest has found extensive applications across various domains:

Finance: It is used for credit scoring, fraud detection, and stock market prediction.
Healthcare: Random Forest aids in disease diagnosis, predicting patient outcomes, and identifying risk factors.
Image Processing: It is applied in image classification, object recognition, and facial recognition systems.
Environmental Sciences: Random Forest is used to predict species distribution, land cover classification, and environmental monitoring.

Conclusion:

Random Forest has emerged as a powerful and versatile algorithm in the realm of machine learning.
Its ability to handle complex problems, handle large datasets, and provide robust predictions makes it a popular choice among data scientists.
While it may have some limitations, proper understanding, parameter tuning, and interpretation techniques can help harness its full potential.
As machine learning continues to advance, Random Forest remains a valuable tool for a wide range of applications, empowering us to gain deeper insights from data and make more accurate predictions.

For model implementation in Python: https://github.com/Dishantkharkar/Machine_learning_Models/blob/main/Decision%20tree%20Random%20forest/Decision_Tree_RandomForest.ipynb

If you learned something from this blog, make sure you give it a 👏🏼

Will meet you in some other blog, till then Peace ✌🏼.

Thank_You_