Random Forest Classification in Python

Shuvrajyoti Debroy
11 min readFeb 9, 2023

--

Machine Learning Classification Algorithm

Background Image Source: Analytics Insight

Introduction

Random Forest is a popular machine learning algorithm that is used for classification and regression analysis. It is an ensemble of decision trees that work together to make more accurate predictions than a single decision tree.

The algorithm works by creating multiple decision trees, each trained on a random subset of the data and a random subset of the features. The predictions from each tree are combined by taking the majority vote for classification tasks or averaging for regression tasks. This combination of multiple trees helps to reduce overfitting and increase the overall accuracy of the model.

Random Forest can handle both continuous and categorical features and can be used for both binary classification and multi-class classification. In addition, the algorithm is robust to irrelevant features, meaning that it can still produce accurate predictions even if some of the features are not useful for the classification task.

Image Source: Semantic Scholar

Implement Random Forest Classification in Python

In this example, we will use the social network ads data concerning the Gender, Age, and Estimated Salary of several users and based on these data we would classify each user whether they would purchase the insurance or not.

Step 1: Import libraries

We need Pandas for data manipulation, NumPy for mathematical calculations, MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for machine learning operations

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

Step 2: Import data

Download the dataset from here and upload it to your notebook and read it into the pandas dataframe.

# Read dataset
df_net = pd.read_csv('/content/Social_Network_Ads.csv')
df_net.head()

Step 3: Data Analysis / Preprocessing

Exploratory Data Analysis (EDA) is a process of analyzing and summarizing the main characteristics of a dataset, with the goal of gaining insight into the underlying structure, relationships, and patterns within the data. EDA helps to identify important features, anomalies, and trends in the data that can inform further analysis and modeling.

EDA typically involves several key steps, including:

  • Data cleaning and preparation involve removing missing or incorrect values, transforming variables, and handling outliers.
  • Data visualization is the process of creating graphs, charts, and other visual representations of the data to help identify patterns, relationships, and anomalies.
  • Statistical analysis involves applying mathematical and statistical methods to the data to identify important features and relationships.

Preprocessing aims to prepare the data in a way that will enable effective analysis and modeling and remove any biases or errors that may affect the results.

Get required data

We don’t need the User ID column so we can drop it.

# Get required data
df_net.drop(columns = ['User ID'], inplace=True)
df_net.head()

Describe data

Get statistical description of data using Pandas describe() function. It shows us the count, mean, standard deviation, and range of data.

# Describe data
df_net.describe()

Distribution of data

Check data distribution.

# Salary distribution
sns.distplot(df_net['EstimatedSalary'])

Label encoding

Label encoding is a preprocessing technique in machine learning and data analysis where categorical data is converted into numerical values, to make it compatible with mathematical operations and models.

The categorical data is assigned an integer value, typically starting from 0, and each unique category in the data is given a unique integer value so that the categorical data can be treated as numerical data.

# Label encoding
le = LabelEncoder()
df_net['Gender']= le.fit_transform(df_net['Gender'])

Correlation matrix

A correlation matrix is a table that summarizes the relationship between multiple variables in a dataset. It shows the correlation coefficients between each pair of variables, which indicate the strength and direction of the relationship between the variables. It is useful for identifying highly correlated variables and selecting a subset of variables for further analysis.

The correlation coefficient can range from -1 to 1, where:

  • A correlation coefficient of -1 indicates a strong negative relationship between two variables
  • A correlation coefficient of 0 indicates no relationship between two variables
  • A correlation coefficient of 1 indicates a strong positive relationship between two variables
# Correlation matrix
df_net.corr()
sns.heatmap(df_net.corr())

Drop insignificant data

From the correlation matrix, we see that Gender is not correlated to other attributes so we can drop that too.

# Drop Gender column
df_net.drop(columns=['Gender'], inplace=True)

Step 4: Split data

Splitting data into independent and dependent variables involves separating the input features (independent variables) from the target variable (dependent variable). The independent variables are used to predict the value of the dependent variable.

The data is then split into a training set and a test set, with the training set used to fit the model and the test set used to evaluate its performance.

Independent / Dependent variables

In our data Age, EstimatedSalary is the independent variable assigned as X, and Purchased is the dependent variable y.

# Split data into dependent/independent variables
X = df_net.iloc[:, :-1].values
y = df_net.iloc[:, -1].values

Train / Test split

The data is usually divided into two parts, with the majority of the data used for training the model and a smaller portion used for testing.

The training set is used to train the model and find the optimal parameters. The model is then tested on the test set to evaluate its performance and determine its accuracy. This is important because if the model is trained and tested on the same data, it may over-fit the data and perform poorly on new, unseen data.

We have split the data into 75% for training and 25% for testing.

# Split data into test/train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = True)

Step 5: Feature scaling

Feature scaling is a method of transforming the values of numeric variables so that they have a common scale as machine learning algorithms are sensitive to the scale of the input features.

There are two common methods of feature scaling: normalization and standardization.

  • Normalization scales the values of the variables so that they fall between 0 and 1. This is done by subtracting the minimum value of the feature and dividing it by the range (max-min).
  • Standardization transforms the values of the variables so that they have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing it by the standard deviation.

Feature scaling is usually performed before training a model, as it can improve the performance of the model and reduce the time required to train it, and helps to ensure that the algorithm is not biased towards variables with larger values.

# Scale dataset
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 6: Train model

Training a machine learning model involves using a training dataset to estimate the parameters of the model. The training process uses a learning algorithm that iteratively updates the model parameters, minimizes a loss function, which measures the difference between the predicted values and the actual values in the training data, and updates the model parameters to improve the accuracy of the model.

Pass the X_train and y_train data into the Random Forest classifier model by classifier.fit to train the model with our training data.

# Random Forest Classification
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

Step 7: Predict result / Score model

Once the model is trained, it can be used to make predictions on new data. Each tree is grown using a random subset of the training data and a random subset of the features and each tree in the forest will make a prediction for an input sample. The final prediction is made by aggregating the predictions of all the trees by majority voting for classification problems.

The accuracy of the model can be evaluated on a test set, which was previously held out from the training process.

# Prediction
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

Step 8: Evaluate model

Accuracy is a useful metric for assessing the performance of a model, but it can be misleading in some cases. For example, in a highly imbalanced dataset, a model that always predicts the majority class will have high accuracy, even though it may not be performing well. Therefore, it is important to consider other metrics, such as confusion matrix, precision, recall, F1-score, and ROC-AUC, along with accuracy, to get a more complete picture of the performance of a model.

Accuracy

Accuracy is a commonly used metric for evaluating the performance of a machine learning model. It measures the proportion of correct predictions made by the model on a given dataset.

In a binary classification problem, accuracy is defined as the number of correct predictions divided by the total number of predictions. In a multi-class classification problem, accuracy is the average of the individual class accuracy scores.

# Accuracy
accuracy_score(y_test, y_pred)

Classification report

A classification report is a summary of the performance of a classification model. It provides several metrics for evaluating the performance of the model on a classification task, including precision, recall, f1-score, and support.

The classification report also provides a weighted average of the individual class scores, which takes into account the imbalance in the distribution of classes in the dataset.

# Classification report
print(f'Classification Report: \n{classification_report(y_test, y_pred)}')

F1 score

F1-score is the harmonic mean of precision and recall. It provides a single score that balances precision and recall. Support is the number of instances of each class in the evaluation dataset.

# F1 score
print(f"F1 Score : {f1_score(y_test, y_pred)}")

Confusion matrix

A confusion matrix is used to evaluate the performance of a classification model. It summarizes the model’s performance by comparing the actual class labels of the data to the predicted class labels generated by the model.

True Positives (TP): Correctly predicted positive instances.
False Positives (FP): Incorrectly predicted positive instances.
True Negatives (TN): Correctly predicted negative instances.
False Negatives (FN): Incorrectly predicted negative instances.

It provides a clear and detailed understanding of how well the model is performing and helps to identify areas of improvement.

# Confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)

Precision-Recall curve

A precision-recall curve is a plot that summarizes the performance of a binary classification model as a trade-off between precision and recall and is useful for evaluating the model’s ability to make accurate positive predictions while finding as many positive instances as possible. Precision and Recall are two common metrics for evaluating the performance of a classification model.

Precision is the number of true positive predictions divided by the sum of true positive and false positive predictions. It measures the accuracy of the positive predictions made by the model.

Recall is the number of true positive predictions divided by the sum of true positive and false negative predictions. It measures the ability of the model to find all positive instances.

# Plot Precision-Recall Curve
y_pred_proba = classifier.predict_proba(X_test)[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(6,6))
ax.plot(recall, precision, label='Random Forest Classification', color = 'firebrick')
ax.set_title('Precision-Recall Curve')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
plt.box(False)
ax.legend();

AUC/ROC curve

The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) are commonly used metrics for evaluating the performance of a binary classification model.

A ROC curve plots the True Positive Rate (TPR) versus the False Positive Rate (FPR) for different thresholds of the model’s prediction probabilities. The TPR is the number of true positive predictions divided by the number of actual positive instances, while the FPR is the number of false positive predictions divided by the number of actual negative instances.

The AUC is the area under the ROC curve and provides a single-number metric that summarizes the performance of the model over the entire range of possible thresholds.

A high AUC indicates that the model is able to distinguish positive instances from negative instances well.

# Plot AUC/ROC curve
y_pred_proba = classifier.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(6,6))
ax.plot(fpr, tpr, label='Random Forest Classification', color = 'firebrick')
ax.set_title('ROC Curve')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
plt.box(False)
ax.legend();

Visualization predictions

Prediction results on the training set

Prediction results on the test set

Example

Let’s see with an example of an Age of 45 and a Salary of 97000 and check if the user is likely to purchase the insurance or not.

# Predict purchase with Age(45) and Salary(97000)
print(classifier.predict(sc.transform([[45, 97000]])))

Predicted value [1] means the user is going to purchase the insurance.

Full Code at GitHub

You can get the full code in my GitHub repository.

Conclusion

One of the advantages of Random Forest is that it is easy to use and produces accurate results without the need for extensive tuning of parameters. In addition, the algorithm provides feature importance scores, which can be used to rank the importance of different features and understand the relationship between the features and the target variable.

In conclusion, Random Forest is a versatile and powerful machine-learning algorithm that is used for both classification and regression analysis. It works by combining multiple decision trees and is robust to irrelevant features and overfitting. The algorithm is easy to use and provides feature importance scores, making it a popular choice for many applications.

--

--