10 Must-Know Models for ML Beginners: Random Forest

5 min readFeb 23, 2024

This article is part of the series 10 Must-Know Models for ML Beginners.

Introduction

Machine learning is transforming the way we interact with data, and among the many algorithms at our disposal, the Random Forest model stands out for its power and versatility. If you’re exploring machine learning, understanding Random Forests is a great way to gain insight into the world of predictive modeling. Let’s demystify this powerful technique!

What is a Random Forest?

In essence, a Random Forest is an ensemble learning method, meaning it harnesses the collective wisdom of multiple decision trees. Decision trees are simple models that learn to split data into groups based on a series of yes/no questions about the features. Think of a decision tree as a branching flowchart guiding your data towards specific outcomes.

A Random Forest takes many such decision trees and trains them, introducing randomness to improve performance. The final prediction is a sort of “average” or “majority vote” from all the individual trees.

Why Use a Random Forest?

Robustness: Random Forests often deliver excellent accuracy on a wide range of problems. They don’t demand meticulous data preparation and are resistant to overfitting (when a model becomes overly tuned to the training data).
Versatility: They can handle both classification tasks (predicting a category) and regression tasks (predicting a continuous value).
Interpretability: While not as directly interpretable as a single decision tree, Random Forests provide a way to assess “feature importance,” showing which input variables have the strongest impact on predictions.

How Does a Random Forest Work?

Let’s break down the key ideas:

Bootstrap Sampling: Instead of training each decision tree on the entire dataset, a Random Forest draws smaller samples of data with replacement (meaning a data point can be selected multiple times). This creates diversity among the trees.
Random Feature Selection: At each step where a decision tree considers splitting the data, only a random subset of the available features is considered. This prevents a few strong features from dominating every tree.
Building the Forest: Many decision trees are trained independently using the techniques above.
Making Predictions: For new data, each tree in the forest makes a prediction. The Random Forest’s final output is the most common prediction (classification) or the average (regression) of all the individual tree predictions.

Constructing a Random Forest

Here’s a breakdown of the general steps involved in building a Random Forest model:

Data Preparation: Ensure your data is in a suitable format, handle missing values, and consider feature scaling if necessary.

Bootstrap Sampling: Repeatedly draw random samples (with replacement) from your dataset. Each of these samples will be used to train a single decision tree.

Feature Subset Selection: At each node (decision point) of a decision tree, randomly choose a subset of features to consider for the best split. This differs from a regular decision tree, which examines all features.

Decision Tree Building: Construct decision trees using the bootstrap samples and restricted feature sets. Allow the trees to grow without heavy pruning (limiting their depth), helping to prevent overfitting.

Forest Formation: Repeat steps 2–4 many times to build your ensemble of trees, your “forest.”

Prediction: To make predictions on new data:

Pass the data point through each tree in the forest.
Aggregate the predictions from all trees — the majority vote wins for classification, and the average of predictions is used for regression.

Limitations of Random Forests

Complexity: They can be slower to train and use than simpler models, especially with very large datasets.
Interpretability Trade-off: While they offer some insights, the combined decision-making of many trees can be harder to fully understand than the logic of a single decision tree.

Example: Iris Classification with Random Forest

We will use a simplified version of the Iris dataset, which is a classic dataset in machine learning, for our example. This dataset consists of 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width, and a target variable indicating the species of the iris (setosa, versicolor, or virginica).

Our goal will be to predict the species of iris flowers based on these features. We will:

Implement a simple decision tree as the base estimator.
Build a random forest by creating a collection of decision trees trained on random subsets of the dataset.
Evaluate the random forest’s performance using accuracy.
Visualize the importance of each feature in the dataset.

The code is available in this colab notebook:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree (simplified for use within Random Forest)
class DecisionTree:
    def __init__(self, max_depth=5, depth=0):
        self.max_depth = max_depth
        self.depth = depth
        self.left = None
        self.right = None
        self.best_feature = None
        self.split_value = None
        self.leaf_value = None

    def fit(self, X, y):
        if len(np.unique(y)) == 1 or self.depth == self.max_depth:
            self.leaf_value = np.bincount(y).argmax()
            return
        best_gini = 1.0
        for feature_index in range(X.shape[1]):
            possible_values = np.unique(X[:, feature_index])
            for value in possible_values:
                left_idxs = np.where(X[:, feature_index] < value)
                right_idxs = np.where(X[:, feature_index] >= value)
                if len(left_idxs[0]) == 0 or len(right_idxs[0]) == 0:
                    continue
                gini = self._gini(y[left_idxs], y[right_idxs])
                if gini < best_gini:
                    best_gini = gini
                    self.best_feature = feature_index
                    self.split_value = value
        if self.best_feature is not None:
            left_idxs = np.where(X[:, self.best_feature] < self.split_value)
            right_idxs = np.where(X[:, self.best_feature] >= self.split_value)
            self.left = DecisionTree(max_depth=self.max_depth, depth=self.depth + 1)
            self.right = DecisionTree(max_depth=self.max_depth, depth=self.depth + 1)
            self.left.fit(X[left_idxs], y[left_idxs])
            self.right.fit(X[right_idxs], y[right_idxs])

    def _gini(self, left_y, right_y):
        left_score = 1 - sum([(np.count_nonzero(left_y == c) / len(left_y)) ** 2 for c in np.unique(left_y)])
        right_score = 1 - sum([(np.count_nonzero(right_y == c) / len(right_y)) ** 2 for c in np.unique(right_y)])
        return (len(left_y) * left_score + len(right_y) * right_score) / (len(left_y) + len(right_y))

    def predict(self, X):
        if self.leaf_value is not None:
            return self.leaf_value
        elif X[self.best_feature] < self.split_value:
            return self.left.predict(X)
        else:
            return self.right.predict(X)

# Random Forest
class RandomForest:
    def __init__(self, n_estimators=10, max_depth=5):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.trees = []

    def fit(self, X, y):
        for _ in range(self.n_estimators):
            idxs = np.random.choice(len(X), len(X))
            tree = DecisionTree(max_depth=self.max_depth)
            tree.fit(X[idxs], y[idxs])
            self.trees.append(tree)

    def predict(self, X):
        tree_preds = np.array([tree.predict(x) for tree in self.trees for x in X])
        tree_preds = tree_preds.reshape(self.n_estimators, X.shape[0])
        return np.array([np.bincount(tree_preds[:,i]).argmax() for i in range(X.shape[0])])

# Train and evaluate the Random Forest
rf = RandomForest(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')

# Feature Importance Visualization (simplified approach)
importances = np.zeros(X_train.shape[1])
for tree in rf.trees:
    if tree.best_feature is not None:
        importances[tree.best_feature] += 1
importances /= rf.n_estimators

plt.bar(range(X_train.shape[1]), importances)
plt.xticks(range(X_train.shape[1]), ['sepal length', 'sepal width', 'petal length', 'petal width'])
plt.ylabel('Importance')
plt.title('Feature Importance in Random Forest')
plt.show()

Output:

Accuracy: 1.0

Conclusion

Random forests are a valuable tool for any data scientist or machine learning enthusiast. Their power, accuracy, and relative ease of use make them incredibly popular. If you’re working with tabular data (think spreadsheets) and seek a robust model to make predictions, give Random Forests a try!