Random forests

om pramod
6 min readJan 29, 2023

--

Random Forest is a supervised machine learning algorithm that is used for both classification and regression tasks. It is a type of ensemble learning that combines multiple decision trees to produce a more robust model, In machine learning, an ensemble is a collection of models whose predictions are averaged (or aggregated in some way).

reference

Random Forest follows bootstrap sampling and bootstrap aggregation techniques to prevent overfitting.

Bootstrap sampling — Bootstrap sampling is a technique where multiple random subsets of the data are created, with replacement, to train multiple decision trees. A bootstrap sample is a sample of the training dataset where a sample may appear more than once in the sample, referred to as sampling with replacement. In simple words, In the context of machine learning, bootstrap sampling is used to generate multiple random subsets of the training data, known as bootstrapped samples, to train multiple decision trees. Each bootstrapped sample has the same size as the original dataset but contains some repeated samples and some omitted samples.

For example, let’s say we have a sample of 10 data points: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. To generate a bootstrapped sample, we randomly select 10 data points from this original sample with replacement, which means that some data points may be repeated in the bootstrapped sample.

For instance, one possible bootstrapped sample could be [3, 4, 4, 5, 6, 7, 7, 8, 10, 10].

Bootstrap Aggregation (bagging) — In bagging, multiple models are trained on different random samples of the data (obtained using bootstrap sampling), and their predictions are combined to produce a final prediction. Bagging can be used with any type of machine learning model, but is often used with decision trees to produce Random Forest models. In classification problems, the final prediction is based on majority voting among the decision trees. In regression problems, the final prediction is the average of the predictions of all decision trees.

Each decision tree in a random forest is trained on a different subset of the data and features, so the chances of all decision trees overfitting the data is low. Additionally, the randomness in the selection of subsets of data and features also helps to reduce overfitting as it introduces more diversity in the decision trees.

The steps involved in the Random Forest algorithm are:

  1. Bootstrap Sampling —

create multiple bootstrapped samples. In Random Forest, bootstrap sampling is performed for “n_estimators” times to create “n_estimators” different bootstrapped samples from the original dataset.

2. Building Decision Trees:

“n_estimators” is a hyperparameter in the Random Forest algorithm. It represents the number of decision trees that the Random Forest algorithm builds. Each bootstrapped sample is used to train a separate decision tree model. The decision trees are grown until they reach the maximum depth specified(max_depth — default value is None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples) or until they contain the minimum number of samples required to split a node(Min_sample_split — default value is 2). Default value for this is 100.

Choosing Random Subset of Features: In each iteration of building a decision tree, a random subset of the total features is selected as the candidate features for splitting at each node. This helps in decorrelating the trees and makes the model more robust to the presence of irrelevant features.

Suppose you have a dataset with 10 features (A, B, C, D, E, F, G, H, I, and J). When building a decision tree in a random forest, instead of using all 10 features at each split, you can randomly choose a subset of features, for example, 3 features, at each split. So, in the first iteration, the selected features could be A, C, and G. In the next iteration, it could be B, D, and H. This process continues until all the trees in the random forest are built.

In scikit-learn’s implementation of the random forest algorithm, the size of the random subset of features is controlled by the “max_features” parameter. By default, it is set to “auto” which is the square root of the total number of features. However, you can also set it to an integer value, for example:

from sklearn.ensemble import RandomForestClassifier
# Using the default setting for max_features, which is sqrt(total number of features)
rf_clf = RandomForestClassifier(n_estimators=100)
# Using max_features = 3
rf_clf = RandomForestClassifier(n_estimators=100, max_features=3)

By using a random subset of features, random forest avoids overfitting, which can occur when a decision tree is built using all the features. Additionally, this step also helps to reduce the computation time, as not all the features are used in building each tree.

The best hyperparameters can be selected using techniques like cross-validation or GridSearchCV.

3. Making Predictions — using bagging

Note –

Out-of-Bag (OOB) Error Estimation: Since each decision tree is built on a bootstrapped sample, there are some samples that are left out of the bootstrapped sample, known as Out-of-Bag (OOB) samples. The prediction on these OOB samples can be used to estimate the performance of the model.

oob_score is an attribute of the random forest model in scikit-learn library. It provides an estimate of the out-of-bag (OOB) error of the model, which is the average prediction error of each test sample that is not used in the training of the corresponding decision tree. The OOB error estimate is a convenient and efficient way of validating the performance of the random forest model, as it does not require additional validation datasets or cross-validation techniques. By default, oob_score is set to False in scikit-learn, and it can be turned on by setting oob_score=True when creating the random forest model.

# Initializing the Random Forest classifier
clf = RandomForestClassifier(oob_score=True)

Python implementation:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# create a grid of hyperparameters to search over
param_grid = {
'n_estimators': [100, 150],
'max_features': ['auto', 'sqrt', 2],
'max_depth': [3, 6, 9, None],
'min_samples_split': [2, 4, 6, 7, 9],
'min_samples_leaf': [1, 2, 3, 4],
"criterion": ["gini", "entropy"],
"oob_score": [True, False]
}

# instantiate the RandomForestClassifier
rf = RandomForestClassifier(oob_score=True, random_state=0)

# fit the grid search to the data
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# print the best hyperparameters
print("Best parameters: ", grid_search.best_params_)
# print the oob_score
print("OOB score: ", grid_search.best_estimator_.oob_score_)

The output would look like:

Best parameters: {'criterion': 'gini',
'max_depth': 9,
'max_features': 'auto',
'min_samples_leaf': 1,
'min_samples_split': 2,
'n_estimators': 100,
'oob_score': True}
OOB score: 0.9428571428571428

The OOB score represents the accuracy of the model. The value of 0.9428571428571428 in this case indicates that the model correctly predicts the target variable 94.29% of the time using the out-of-bag samples. It is a measure of how well the random forest classifier is generalizing to new, unseen data.

Note- best_estimator_ returns the trained machine learning model with the best hyperparameters found by the grid search, while best_params_ returns a dictionary that lists the best hyperparameters found by the grid search.

Final Note: Thanks for reading! I hope you find this article informative.

Wanna connect with me? Hit me up on LinkedIn

--

--