Surface Level Understanding of Random Forest Regression

Viswa
6 min readAug 15, 2023

--

Random Forest Regression is a powerful machine learning algorithm that combines the principles of decision trees and ensemble learning to perform regression tasks. It is an extension of the Random Forest algorithm, which is primarily used for classification tasks. Random Forest Regression has gained popularity in various domains due to its ability to handle complex regression problems and produce reliable predictions.

Table of Contents

  1. Introduction
  2. Random Forest Regression
  3. Random Forest Regression Practical Implementation
  4. Advantages of Random Forest Regression
  5. Conclusion

Introduction

To grasp the concept of Random Forest Regression, it’s essential to understand ensemble learning and decision trees. Ensemble learning refers to the process of combining multiple machine learning models to make more accurate and robust predictions. Decision trees, on the other hand, are hierarchical structures that partition the feature space into segments, making them suitable for both classification and regression tasks.

Working of the algorithm

Building Decision Trees in Random Forest Regression

Random Forest Regression creates an ensemble of decision trees. The process starts by constructing a predefined number of decision trees. Each tree is trained on a randomly selected subset of the training data, allowing samples to be present in multiple trees. Additionally, for each tree, a random subset of features (predictor variables) is considered at each split. This randomness ensures diversity among the trees, making the ensemble more robust and less prone to overfitting.

Training Decision Trees

Once the decision trees are created, they are trained independently using a process called recursive partitioning. Recursive partitioning involves repeatedly splitting the data based on specific conditions, such as minimizing the mean squared error or maximizing the coefficient of determination (R-squared). Each split creates branches in the decision tree until a stopping criterion is met, such as reaching a maximum depth or a minimum number of samples per leaf.

Making Predictions with Random Forest Regression

After training the decision trees, Random Forest Regression makes predictions by aggregating the outputs of all the trees. For regression tasks, the predictions are typically obtained by averaging the predicted values from individual trees. By combining the predictions of multiple trees, Random Forest Regression leverages the collective knowledge of the ensemble, resulting in more accurate and stable predictions compared to a single decision tree.

Random Forest Regression Practical Implementation

This is the section where you’ll find out how to perform the random forest regression in Python.

We will analyze data from a combined cycle power plant to attempt to build a predictive model for output power.

Step 1: Importing Python Libraries

The first step is to start your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Here are the important libraries that we will be needing for this linear regression.

  • NumPy (to perform certain mathematical operations)
  • pandas (to store the data in a pandas Data Frames)
  • matplotlib.pyplot (you will use matplotlib to plot the data)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

Let us now import data into a DataFrame. A DataFrame is a data type in Python. The simplest way to understand it would be that it stores all your data in tabular format.

df = pd.read_csv('Data[1].csv')
df.head()
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

Step 3 : Splitting the dataset into the Training and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

This line imports the function train_test_split from the sklearn.model_selection module. This module provides various methods for splitting data into subsets for model training, evaluation, and validation.

Here, X and y represent your input features and corresponding target values, respectively. The test_size parameter specifies the proportion of the data that should be allocated for testing. In this case, test_size=0.25 means that 25% of the data will be reserved for testing, while the remaining 75% will be used for training.

The random_state parameter is an optional argument that allows you to set a seed value for the random number generator. By providing a specific random_state value (e.g., random_state=42), you ensure that the data is split in a reproducible manner

The train_test_split function returns four separate arrays: X_train, X_test, y_train, and y_test. X_train and y_train represent the training data, while X_test and y_test represent the testing data.

Step 4 : Training the Random Forest Regression model on the Training set

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100)
regressor.fit(X_train,y_train)

The first line imports the RandomForestRegressor class from the sklearn.ensemble module. This class is part of the scikit-learn library and provides the implementation of a random forest regressor, which is an ensemble learning technique based on multiple decision tree regressors.

The second line creates an instance of the RandomForestRegressor class and assigns it to the variable named regressor. This instance represents the random forest regression model that will be trained on the data. The n_estimators parameter is set to 100, which means the random forest will consist of 100 decision trees.

The third line of code : This line trains the decision tree regression model using the fit() method. It takes two parameters X_train and y_train.

By calling the fit() method, the decision tree regression model learns from the provided training data and builds a tree-like structure that captures the relationships between the features and the target values.

Step 5 : Predicting the Test set results

y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(y_pred)

This line of code uses the predict method of the trained regressor object to generate predictions for the test data X_test. The predict() method takes the input features (X_test) as an argument and returns the predicted values for the target variable (y_pred).

Step 6 : Evaluating the Model Performance

from sklearn.metrics import r2_score
r2_score(y_pred,y_test)

This code imports the r2_score function from scikit-learn's metrics module. The r2_score function is commonly used as an evaluation metric for regression models, including linear regression. It measures the proportion of the variance in the target variable that is predictable from the input features.

A higher R-squared score indicates a better fit of the regression model to the data, where 1 represents a perfect fit and 0 represents no relationship between the predicted and actual values.

An R-squared score of 0.9603 for the regressor indicates that approximately 96.03% of the variance in the target variable can be explained by the linear regression model's predictions. This indicates a very good fit of the model to the data.

Link to the Code:

Advantages of Random Forest Regression

  1. Robustness against overfitting : Random Forest Regression reduces overfitting by using the ensemble of decision trees and the random selection of features and training samples.
  2. Handling of missing values : The algorithm can handle missing values by imputing them with mean or median values, ensuring robust predictions even with incomplete data.
  3. Capturing complex relationships : Random Forest Regression can capture complex non-linear relationships between features and the target variable. It can uncover interactions and patterns that may not be evident with simpler regression models.

Conclusion

Random Forest Regression is a versatile machine learning algorithm that combines the power of decision trees and ensemble learning. By creating an ensemble of decision trees and leveraging their collective knowledge, Random Forest Regression can tackle complex regression tasks, handle missing values, and produce reliable predictions. Its ability to capture non-linear relationships makes it valuable across various domains, enabling better decision-making and prediction accuracy.

--

--

Viswa

I am a passionate writer, I specialize in crafting engaging content at the intersection of data and technology. With a focus on data science, machine learning.