Random Forest Classifier in Python

wahyu.bodromurti
3 min readJun 8, 2023

--

Hi!

In this post, I will demo how to do Random Forest Classifier in Python. Firstly, you have to download the Melbourne Housing Data that I used from Kaggle.

Melbourne Housing Data

Step 1 — Preprocessing Data

import pandas as pd
melb_data = pd.read_excel(‘melb_data.xlsx’) # import from Excel
melb_data.isnull().sum() # how many NAs in each variables
melb_data = melb_data.dropna() # delete rows which contain NA values
melb_data.head(5)

aa = melb_data.describe() # descriptive for all numeric variables
bb = melb_data[‘Method’].value_counts() # descriptive for a categoric variable

# add a new categorical column (divide the response to ‘sold’ and ‘others’)
melb_data[‘method_group’] = ‘’
melb_data.loc[(melb_data.Method ==’S’) | (melb_data.Method ==’SA’) | (melb_data.Method==’SP’),’method_group’]=’sold’
import numpy as np
melb_data[‘method_group’] = np.where(melb_data[‘method_group’] == ‘’, ‘others’,melb_data[‘method_group’])

# One-hot encode the data using pandas get_dummies
variables1 = melb_data[[‘Rooms’,
‘Type’,
‘Price’,
‘Distance’,
‘Bedroom2’,
‘Bathroom’,
‘Car’,
‘Landsize’]]
variables1 = pd.get_dummies(variables1)

# make the explanatory variables (X) and a response variable (Y)
import numpy as np
# response are the values we want to predict (Y variable)
response = melb_data[‘method_group’]
# Remove the labels from the features
#variables = melb_data.drop([‘method_group’,’Method’],axis=1)
variables = variables1

Step 2 — Splitting the Data into Train and Test datasets

# Training and Testing Sets
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets (70:30)
train_var, test_var, train_resp, test_resp = train_test_split(variables, response, test_size = 0.30, random_state = 100)

Step 3 — Modelling

# Train Model
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_var, train_resp)

# Applied on the Test Data
# Use the forest’s predict method on the test data
predictions = rf.predict(test_var)

Step 4 — Performance Evaluation Metrics of Classification

# change response variable into numeric
y_test = np.where(test_resp == ‘sold’, 1, 0)
y_pred = np.where(predictions == ‘sold’, 1, 0)

# Calculate and display accuracy, precision, recall
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(“Accuracy:”, accuracy)
print(“Precision:”, precision)
print(“Recall:”, recall)

Output:
Accuracy: 0.7810650887573964
Precision: 0.8080459770114943
Recall: 0.9506423258958756

# Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm).plot();

Confusion Matrix

# Create variable importances from the model
var_importances = pd.Series(rf.feature_importances_, index=train_var.columns).sort_values(ascending=False)
# Plot a simple bar chart
var_importances.plot.bar();

Variable Importance

Done!

Warm wishes,
Bodro

--

--

wahyu.bodromurti
0 Followers

Data Science Enthusiast and Indonesian Traditional Dancer