Random Forest with Grid Search
Don’t miss the forest for the trees
This is one of the famous quotes in the world. That works in ML literature as well. Random Forest normally has better performance than decision tree.
That’s why Random Forest is one of my favorites in ML algorithms. We are going to implement Random Forest in Google Colab. In this blog, we walk through implementing Random Forest step by step:
- Getting dataset
- Reading the dataset and plotting histogram for numerical variables
- One-hot encoding categorical variables
- Splitting train and test data
- Oversampling the minor class instances
- Fitting Random Forest
- Validating the model performance
- Hyperparameter tuning(Grid Search)
- Feature Importance
Getting Dataset
Flight booking data is used. The dataset is from Kaggle public dataset. This dataset has 50,000 records with 14 features.
The column description is below. The target is booking_complete
. This shows whether customers complete a booking or not.
num_passengers
= number of passengers travelingsales_channel
= sales channel booking was made ontrip_type
= trip type (Round Trip, One Way, Circle Trip)purchase_lead
= number of days between travel date and booking datelength_of_stay
= number of days spent at destinationflight_hour
= hour of flight departureflight_day
= day of the week of flight departureroute
= origin -> destination flight routebooking_origin
= country from where the booking was madewants_extra_baggage
= if the customer wanted extra baggage in the bookingwants_preferred_seat
= if the customer wanted a preferred seat in the bookingwants_in_flight_meals
= if the customer wanted in-flight meals in the bookingflight_duration
= total duration of flight (in hours)booking_complete
= flag indicating if the customer completed the booking
Reading the dataset and plotting histogram for numerical variables
import pandas as pd
df = pd.read_csv('customer_booking.csv', encoding='ISO-8859–1')
df.hist(figsize = (10, 10))
From the histograms above, the dependent variable booking_complete
is imbalanced and 5 variables are categorical, resulting in one-hot encoding.
for
loop is useful to identify categorical variables. The condition is if the first row is str
or not.
cat_cols = []
num_cols = []
for col in df.columns:
if type(df[col][0]) == str:
cat_cols.append(col)
else:
num_cols.append(col)
One-hot encoding categorical variables
There are mainly two methods of one-hot encoding in sklearn
and pandas
. Using pandas.get_dummies
is easier coding-wise. And, axis=1
should be mentioned in pd.concat
. Otherwise, it will be concatenated in a row direction. After one-hot encoding, the column is dropped.
from pandas import get_dummies
for col in cat_cols:
X = pd.concat([X, pd.get_dummies(X[col])], axis =1 )
X = X.drop(col, axis = 1)
Splitting train and test data
Splitting data is very important in ML modeling. If not completely separated, data leakage causes fatal prediction problems. For example, both training accuracy and test accuracy are good enough; however, this model fails in production. This is because data leakage gets the test accuracy high, sugar-coating low generalization power. Data leakage will be mentioned again in the oversampling step.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
y_train.hist()
Oversampling the minor class instances
As you can see in the histogram above, the distribution of the target is highly imbalanced. The imbalance of the dataset deteriorates the performance of the ML model. Say, a dataset is fraud-related and the fraud rate is 0.01%. If the model blindly predicts all customers are not fraudulent, the accuracy is 99.99% and the model can’t prevent fraud. In other words, the model fails to save even though the accuracy is 99.99%. So, the imbalance should be handled using oversampling the minor class or undersampling the major classes. Oversampling is preferable in the ML scene.
SMOTE takes advantage of the K-nearest neighbor(KNN) algorithm to create minor class datapoints. In detail, SMOTE selects one of the minor class datapoints, identifies close ones based on KNN and chooses one of them.
The thing is that only training data should be oversampled, not test data. If oversampling both, data leakage will happen by mistake.
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
y_train_resampled.hist()
Fitting Random Forest
We’ve done data preparation for modeling. Let’s fit Random Forest. It gives 84%
accuracy.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier()
rf.fit(X_train_resampled, y_train_resampled)
y_pred = rf.predict(X_test)
accuracy_score(y_test, y_pred)
Validating the model performance
Confusion matrix is the first choice when validating a model. As per the name of it, it isn’t easy to understand and interpret correctly. It confuses many smart guys indeed.
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf.classes_)
disp.plot()
plt.show()
The recall is 23%
, so only 23% of truly complete bookings are predicted to be booked completely.
from sklearn.metrics import recall_score
recall_score(y_test, y_pred)
Precision is 42%
, so less than half of the predicted complete booking is true.
from sklearn.metrics import precision_score
precision_score(y_test, y_pred)
Hyperparameter tuning(Grid Search)
We can’t stay here. Let’s try to improve more. Grid search with CV will help us. We tune some of the hyperparameters as below.
max_depth
: the maximum level of each tree. A deeper tree is more overfitted. So, a high value makes the model fail to generalize.n_estimators
: the number of trees in the forest.max_features
: the number of features. This is one of the main hyperparameters that prevents overfitting. The square root of the total number of features is recommended.min_samples_leaf
: the minimum number of samples required to be at the leaf node of each tree.
verbose = 3
shows how the model searches the grid search space.
from sklearn.model_selection import GridSearchCV
rf_grid = RandomForestClassifier()
gr_space = {
'max_depth': [3,5,7,10],
'n_estimators': [100, 200, 300, 400, 500],
'max_features': [10, 20, 30 , 40],
'min_samples_leaf': [1, 2, 4]
}
grid = GridSearchCV(rf_grid, gr_space, cv = 3, scoring='accuracy', verbose = 3)
model_grid = grid.fit(X_train_resampled, y_train_resampled)
print('Best hyperparameters are '+str(model_grid.best_params_))
print('Best score is: ' + str(model_grid.best_score_))
More than 4 hours with Google Colab CPU are taken, but accuracy is worsened than before.
Recall is 47%
and precision is 30%
. In terms of recall and precision, it looks better than before hyperparameter tuning because the previous one f1=0.297
and this one f1=0.366
. Hence, what we have done works.
rf_optimized = model_grid.best_estimator_
y_pred = rf_optimized.predict(X_test)
accuracy_score(y_test, y_pred)
recall_score(y_test, y_pred)
precision_score(y_test, y_pred)
Feature importance
Random Forest provides feature importance from where we can interpret the model. The top 10 important features are the day of departure(Mon, Tue, Wed, Fri, and Sun), Origin(Australia and South Korea), flight duration, length of stay, and sales channel(mobile). Surprisingly, Thursday and Saturday have less impact on the model.
Conclusion
A random forest model was built using the airline booking dataset. During the data preprocessing stage, one-hot encoding was applied to categorical variables, and oversampling was performed to address the imbalanced data. The training and test data were separated to ensure the reliability of the model performance evaluation. The initial random forest model achieved an accuracy of 84%, but had lower recall and precision. As a result, hyperparameter tuning was performed, and the F1 score improved to 0.366. Feature importance analysis revealed that departure day, departure origin, flight duration, length of stay, and sales channel were the key predictive factors. Overall, effective preprocessing of the imbalanced data, hyperparameter tuning and feature importance analysis helped improve the performance of the random forest model. These insights can provide valuable information for real-world airline booking prediction problems.