Random Forest with Grid Search

Published in

Cloud Villains

6 min readMar 13, 2024

Don’t miss the forest for the trees

This is one of the famous quotes in the world. That works in ML literature as well. Random Forest normally has better performance than decision tree.

That’s why Random Forest is one of my favorites in ML algorithms. We are going to implement Random Forest in Google Colab. In this blog, we walk through implementing Random Forest step by step:

Getting dataset
Reading the dataset and plotting histogram for numerical variables
One-hot encoding categorical variables
Splitting train and test data
Oversampling the minor class instances
Fitting Random Forest
Validating the model performance
Hyperparameter tuning(Grid Search)
Feature Importance

Getting Dataset

Flight booking data is used. The dataset is from Kaggle public dataset. This dataset has 50,000 records with 14 features.

Flight Bookings data

Data of customer booking details for British Airways

www.kaggle.com

The column description is below. The target is booking_complete. This shows whether customers complete a booking or not.

num_passengers = number of passengers traveling
sales_channel = sales channel booking was made on
trip_type = trip type (Round Trip, One Way, Circle Trip)
purchase_lead = number of days between travel date and booking date
length_of_stay = number of days spent at destination
flight_hour = hour of flight departure
flight_day = day of the week of flight departure
route = origin -> destination flight route
booking_origin = country from where the booking was made
wants_extra_baggage = if the customer wanted extra baggage in the booking
wants_preferred_seat = if the customer wanted a preferred seat in the booking
wants_in_flight_meals = if the customer wanted in-flight meals in the booking
flight_duration = total duration of flight (in hours)
booking_complete = flag indicating if the customer completed the booking

Reading the dataset and plotting histogram for numerical variables

import pandas as pd

df = pd.read_csv('customer_booking.csv', encoding='ISO-8859–1')
df.hist(figsize = (10, 10))

From the histograms above, the dependent variable booking_complete is imbalanced and 5 variables are categorical, resulting in one-hot encoding.

for loop is useful to identify categorical variables. The condition is if the first row is str or not.

cat_cols = []
num_cols = []

for col in df.columns:
  if type(df[col][0]) == str:
    cat_cols.append(col)
  else:
    num_cols.append(col)

One-hot encoding categorical variables

There are mainly two methods of one-hot encoding in sklearn and pandas. Using pandas.get_dummies is easier coding-wise. And, axis=1 should be mentioned in pd.concat. Otherwise, it will be concatenated in a row direction. After one-hot encoding, the column is dropped.

from pandas import get_dummies

for col in cat_cols:
  X = pd.concat([X, pd.get_dummies(X[col])], axis =1 )
  X = X.drop(col, axis = 1)

Splitting train and test data

Splitting data is very important in ML modeling. If not completely separated, data leakage causes fatal prediction problems. For example, both training accuracy and test accuracy are good enough; however, this model fails in production. This is because data leakage gets the test accuracy high, sugar-coating low generalization power. Data leakage will be mentioned again in the oversampling step.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
y_train.hist()

Oversampling the minor class instances

As you can see in the histogram above, the distribution of the target is highly imbalanced. The imbalance of the dataset deteriorates the performance of the ML model. Say, a dataset is fraud-related and the fraud rate is 0.01%. If the model blindly predicts all customers are not fraudulent, the accuracy is 99.99% and the model can’t prevent fraud. In other words, the model fails to save even though the accuracy is 99.99%. So, the imbalance should be handled using oversampling the minor class or undersampling the major classes. Oversampling is preferable in the ML scene.

SMOTE takes advantage of the K-nearest neighbor(KNN) algorithm to create minor class datapoints. In detail, SMOTE selects one of the minor class datapoints, identifies close ones based on KNN and chooses one of them.

The thing is that only training data should be oversampled, not test data. If oversampling both, data leakage will happen by mistake.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
y_train_resampled.hist()

Fitting Random Forest

We’ve done data preparation for modeling. Let’s fit Random Forest. It gives 84% accuracy.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()
rf.fit(X_train_resampled, y_train_resampled)

y_pred = rf.predict(X_test)
accuracy_score(y_test, y_pred)

Validating the model performance

Confusion matrix is the first choice when validating a model. As per the name of it, it isn’t easy to understand and interpret correctly. It confuses many smart guys indeed.

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=rf.classes_)
disp.plot()
plt.show()

The recall is 23%, so only 23% of truly complete bookings are predicted to be booked completely.

from sklearn.metrics import recall_score

recall_score(y_test, y_pred)

Precision is 42%, so less than half of the predicted complete booking is true.

from sklearn.metrics import precision_score

precision_score(y_test, y_pred)

Hyperparameter tuning(Grid Search)

We can’t stay here. Let’s try to improve more. Grid search with CV will help us. We tune some of the hyperparameters as below.

max_depth : the maximum level of each tree. A deeper tree is more overfitted. So, a high value makes the model fail to generalize.
n_estimators : the number of trees in the forest.
max_features : the number of features. This is one of the main hyperparameters that prevents overfitting. The square root of the total number of features is recommended.
min_samples_leaf : the minimum number of samples required to be at the leaf node of each tree.

verbose = 3 shows how the model searches the grid search space.

from sklearn.model_selection import GridSearchCV

rf_grid = RandomForestClassifier()
gr_space = {
    'max_depth': [3,5,7,10],
    'n_estimators': [100, 200, 300, 400, 500],
    'max_features': [10, 20, 30 , 40],
    'min_samples_leaf': [1, 2, 4]
}

grid = GridSearchCV(rf_grid, gr_space, cv = 3, scoring='accuracy', verbose = 3)
model_grid = grid.fit(X_train_resampled, y_train_resampled)

print('Best hyperparameters are '+str(model_grid.best_params_))
print('Best score is: ' + str(model_grid.best_score_))

More than 4 hours with Google Colab CPU are taken, but accuracy is worsened than before.

Recall is 47% and precision is 30%. In terms of recall and precision, it looks better than before hyperparameter tuning because the previous one f1=0.297 and this one f1=0.366. Hence, what we have done works.

rf_optimized = model_grid.best_estimator_
y_pred = rf_optimized.predict(X_test)
accuracy_score(y_test, y_pred)
recall_score(y_test, y_pred)
precision_score(y_test, y_pred)

Feature importance

Random Forest provides feature importance from where we can interpret the model. The top 10 important features are the day of departure(Mon, Tue, Wed, Fri, and Sun), Origin(Australia and South Korea), flight duration, length of stay, and sales channel(mobile). Surprisingly, Thursday and Saturday have less impact on the model.

Conclusion

A random forest model was built using the airline booking dataset. During the data preprocessing stage, one-hot encoding was applied to categorical variables, and oversampling was performed to address the imbalanced data. The training and test data were separated to ensure the reliability of the model performance evaluation. The initial random forest model achieved an accuracy of 84%, but had lower recall and precision. As a result, hyperparameter tuning was performed, and the F1 score improved to 0.366. Feature importance analysis revealed that departure day, departure origin, flight duration, length of stay, and sales channel were the key predictive factors. Overall, effective preprocessing of the imbalanced data, hyperparameter tuning and feature importance analysis helped improve the performance of the random forest model. These insights can provide valuable information for real-world airline booking prediction problems.