Random Forest

Francesco Arconte
AI Odyssey
Published in
5 min readNov 5, 2023

--

Introduction

Random Forest (or Random Decision Forests) is a machine learning algorithm that works by combining the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

The first Random Forests ensemble method was created in 1995 by the Hong Kong computer scientist Tin Kam Ho, whose work set the stage for a fundamental shift in the machine learning field. The algorithm, however, has been massively developed thanks to the groundbreaking contributions of statisticians Leo Breiman and Adele Cutler. In the year 2001, Breiman and Cutler introduced a significant extension to the Random Forests ensemble method, making it incredibly more effective and versatile. Recognizing its potential, they trademarked the algorithm in 2006.

As suggested by the name, Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. This means that instead of relying on one decision tree, the Random Forest takes the prediction from each tree, and based on the majority votes of predictions, it forecasts the final output; the greater the number of decision trees in the Forest, the greater the accuracy.

Graphic summary of how Random Forest works

An interesting application of this concept can be seen, for example, when it is needed to classify an email as “spam” or “not spam”, or in a weather prediction.

Example of a Random Forest to predict a destination for a vacation
Credits: Carolina Bento

Random Forests vs. Decision Trees

The strength of Random Forest lies in its ability to mitigate the limitations of individual decision trees, so while a Random Forest model is a collection of them, there are some important differences. The decision trees within a Random Forest are usually trained with the “bagging” method, a type of ensemble machine learning algorithm called Bootstrap Aggregation.

Bootstrap randomly performs row sampling and feature sampling from the dataset for the purpose of forming sample datasets for every model.

Aggregation reduces these sample datasets into summary statistics based on observation and combines them.

Bootstrap Aggregation is used to reduce the variance in Random Forests, which is an error resulting from sensitivity to small fluctuations in the dataset used for training. High variance will cause the problem of “overfitting”, that is the production of an analysis that corresponds too closely or exactly to a set of data, and may fail to fit to additional data or predict future observations reliably. An overfitted model will perform well in training, but won’t be able to distinguish the noise from the signal in an actual test.

Difference between a single decision tree and a Random Forest

Hyperparameter Tuning

In Machine Learning, a hyperparameter is a parameter whose value, along with others, controls the learning process. Just like any complex model, hyperparameters in Random Forest are a fundamental tool: if carefully tuned, they can unleash their full potential.

Key Hyperparameters:

1. Number of Trees (n_estimators): defines the number of decision trees in the ensemble. A higher number often leads to better performance but can increase the risk of overfitting.

2. Tree Depth (max_depth): This parameter determines the maximum depth of each decision tree. A deeper tree can give more accurate predictions but may overfit the training data.

3. Minimum Samples Split (min_samples_split) and Leaf (min_samples_leaf): These hyperparameters control the minimum number of samples required to split an internal node or form a leaf node. Adjusting them helps in preventing the model from being too specific to the training data.

4. Feature Selection (max_features): This parameter is about the number of features considered for splitting at each node. While using all features might lead to overfitting, too few results in less diverse trees.

Balancing those hyperparameters to increase accuracy, while avoiding overfitting, truly represents the key to the mitigation of the overfitting problem.

“Bagging” in detail

The training algorithm for Random Forests employs the general technique of bootstrap aggregating, or bagging, for tree learners.

Given a training set X = x1, …, xn with responses Y = y1, …, yn, bagging repeatedly (B instances) selects a randomized sample with replacement from the training dataset and fits trees to these samples: For b = 1, …, B:

1. Sample, with replacement, n training instances from X, Y; call these Xb, Yb.

2. Train a classification or regression tree fb on Xb, Yb. After training, predictions for unseen samples x’ can be made by averaging the forecasts from all the individual regression trees on x’, or by taking the majority vote in the case of classification trees.

Applications

Random Forest is used in lots of sectors to predict behavior and outcomes thanks to its ease of application, adaptability, and ability to perform both classification and regression tasks.

As of now, it is massively used in the financial sector, where Random Forest is employed to predict mortgage defaults and identify or prevent fraud, which means the algorithm determines whether the customer is likely to default or not. In order to detect fraud, it can analyze a series of transactions and determine whether they are likely to be fraudulent. Another significant application of this algorithm can be seen in the prediction of future stock prices, where it has proved to perform better than any other prediction tool for stock pricing, option pricing, and credit spread forecasting.
Random Forest has a great potential for application in healthcare, and it is in fact opening up many possibilities for early diagnosis that are way cheaper and interpretable than neural networks.

It is safe to say that Random Forest will have a broader range of applications in the future thanks to their versatility, potentially covering every digitalized field that aims to minimize risks and perfect their decisions. According to you, which are the sectors that could consider using Random Forests and why?

--

--