Building a Random Forest Model From Scratch with Python

5 min readNov 7, 2023

Hi, in this second article of my Decision Tree article series we will implement a random forest model from scratch in python.

The Decision Tree Medium Article series:

Building a Decision Tree From Scratch: Check it out since we will build our random forest model on top of the decision tree implemented in it.
Building a Random Forest From Scratch (This article)
Building AdaBoost (Boosted Trees) From Scratch (TBA)

Short History

Random Forest was first proposed by Tin Kam Ho in the article “Random decision forests” (1995). The random forest algorithm was then extended by Leo Breiman and published in “Random Forests” article in 2001.

What is Random Forest?

Let’s first talk about some cons of basic decision trees so we can understand why we need random forest models. Decision trees are prone to overfit. Even though you can play with some hyperparameters or try to prune your tree, sometimes it might be still hard task to overcome the issue. Another issue (which is related with overfitting) is that even though you change the training data a little bit you might end up with a whole different tree. Ensemble models are good solutions for these kind of problems and random forest is an ensemble model. Basically, a random forest contains multiple basic trees (will be named as “base learners”) and predicts the results by taking average of the predictions of each tree. This model is called Random Forest: “forest” because it consists of trees and “random” because the dataset to train each tree is randomly sampled by bootstrapping method.

Building a Random Forest Model From Scratch

In this part we will implement the random forest model from scratch. Below you can see the functions we need for the model. Actually, this class won’t need a lot of functions since it is built by using DecisionTree class in our previous article.

As always, we will have a “train()” function to train the model with a dataset, also “predict()” and “predict_proba()” functions to make a prediction.

NOTE: To see the full code, visit the github code by clicking here. In this article we won’t go over all the code.

The hyperparameters for the random forest model in this implementation:

n_base_learner: # of base learners (basic decision trees)
numb_of_features_splitting: # of features to be considered for splitting in each node of base learners.
bootstrap_sample_size: bootstrapped dataset size for the learners
max_depth, min_samples_leaf, min_information_gain: hyperparameters of base learners (check decision tree article)

Bootstrapping

When we are training our model we will create bootstrap samples which are equal to n_base_learner hyperparameter and the sample size is equal to bootstrap_sample_size hyperparameter.

Bootstrapping is basically sampling with replacement (see Image 3).

def _create_bootstrap_samples(self, X, Y) -> tuple:
    """
    Creates bootstrap samples for each base learner
    """
    bootstrap_samples_X = []
    bootstrap_samples_Y = []

    for i in range(self.n_base_learner):
        
        if not self.bootstrap_sample_size:
            self.bootstrap_sample_size = X.shape[0]
        
        sampled_idx = np.random.choice(X.shape[0], size=self.bootstrap_sample_size, replace=True)
        bootstrap_samples_X.append(X[sampled_idx])
        bootstrap_samples_Y.append(Y[sampled_idx])

    return bootstrap_samples_X, bootstrap_samples_Y

Training

With the bootstrapping samples we will train n_base_learner amount of base learners with the function below.

def train(self, X_train: np.array, Y_train: np.array) -> None:
    """Trains the model with given X and Y datasets"""
    bootstrap_samples_X, bootstrap_samples_Y = self._create_bootstrap_samples(X_train, Y_train)

    self.base_learner_list = []
    for base_learner_idx in range(self.n_base_learner):
        base_learner = DecisionTree(max_depth=self.max_depth, min_samples_leaf=self.min_samples_leaf, \
                                    min_information_gain=self.min_information_gain, 
                                    numb_of_features_splitting=self.numb_of_features_splitting)
        
        base_learner.train(bootstrap_samples_X[base_learner_idx], bootstrap_samples_Y[base_learner_idx])
        self.base_learner_list.append(base_learner)

    # Calculate feature importance
    self.feature_importances = self._calculate_rf_feature_importance(self.base_learner_list)

After training we have basically a list of trained base learners.

Predictions

To make a prediction with a list of trained base learners, we will average the predicted probabilities for each class of every base learner. The average will be the predicted probability of the random forest model.

def _predict_proba_w_base_learners(self,  X_set: np.array) -> list:
    """
    Creates list of predictions for all base learners
    """
    pred_prob_list = []
    for base_learner in self.base_learner_list:
        pred_prob_list.append(base_learner.predict_proba(X_set))

    return pred_prob_list

def predict_proba(self, X_set: np.array) -> list:
    """Returns the predicted probs for a given data set"""

    pred_probs = []
    base_learners_pred_probs = self._predict_proba_w_base_learners(X_set)

    # Average the predicted probabilities of base learners
    for obs in range(X_set.shape[0]):
        base_learner_probs_for_obs = [a[obs] for a in base_learners_pred_probs]
        # Calculate the average for each index
        obs_average_pred_probs = np.mean(base_learner_probs_for_obs, axis=0)
        pred_probs.append(obs_average_pred_probs)

    return pred_probs

Performance of Random Forest Model

To evaluate the performance of the random forest we will compare the results with a basic decision tree model results from our previous article. You can also check the “modelling_examples.ipynb” file in the link. In the table below I have gathered all the results from the notebook.

We can see that our Random Forest model outperforms Decision Tree in all datasets.

Conclusion

By implementing a random forest model from scratch we have one remaining implementation left. Check out my page for the upcoming articles in this serie.

The Decision Tree Medium Article series:

Building a Decision Tree From Scratch
Building a Random Forest From Scratch (This article)
Building AdaBoost (Boosted Trees) From Scratch (TBA)

References

Sklearn Docs: “sklearn.ensemble.RandomForestClassifier” Scikit, scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Accessed 7 Nov. 2023.
Lecture Material Created by a team at LMU Munich: Introduction to Machine Learning (I2ML), slds-lmu.github.io/i2ml/. Accessed 7 Nov. 2023.
Ho, Tin Kam. “Random decision forests.” Proceedings of 3rd international conference on document analysis and recognition. Vol. 1. IEEE, 1995.
Breiman, Leo. “Random forests.” Machine learning 45 (2001): 5–32.

Thanks for Reading

Let me know if you have any comments 🙂

Building a Random Forest Model From Scratch with Python

The Decision Tree Medium Article series:

Short History

What is Random Forest?

Building a Random Forest Model From Scratch

Bootstrapping

Training

Predictions

Performance of Random Forest Model

Conclusion

The Decision Tree Medium Article series:

References

Thanks for Reading

Written by Enozeren