Navigating the Turbulent Waters of Hotel Booking Cancellations: Predictive Modelling (Phase 2)

6 min readMay 25, 2023

In the previous blog post, we explored the hotel cancellations dataset through data visualizations. We discussed various insights about the dataset including recommendations in relation to the analysis.

Why do we need to use predictive models then?

While data visuals help us to understand who our customers are and explain why circumstances happen, predictive modelling is can unearth new customer insights and predict behaviours based on inputs, allowing organizations to tailor marketing strategies, and retain valuable customers.

So in this follow-up blog, we delve deeper into the topic and compare four popular algorithms: Logistic Regression, Decision Tree, Random Forest, and XGBoost. Our analysis reveals that Random Forest emerges as the best-performing model in terms of hotel cancellations prediction accuracy. So let’s explore these algorithms and their performance in detail.

To help us understand, let’s first go through a few definitions here:

Logistic Regression

Logistic Regression is a widely used algorithm for classification problems. It models the relationship between a dependent variable and one or more independent variables. In the context of hotel cancellations prediction, Logistic Regression can be used to estimate the probability of a booking being cancelled based on various factors such as booking lead time, previous cancellations, and customer information.

Decision Tree

Decision Tree algorithms partition the feature space into a hierarchical structure of decision nodes and leaf nodes. Each decision node applies a splitting criterion to determine which feature to use for the split. Decision Trees are intuitive to interpret and can handle both categorical and numerical data. However, they tend to overfit the training data, leading to poor generalization on unseen data.

Random Forest

Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. It addresses the overfitting issue of Decision Trees by creating a diverse set of trees through a combination of random feature selection and bootstrap aggregating (bagging). Random Forest improves prediction accuracy and reduces variance by averaging the predictions of individual trees. It is known for its robustness and ability to handle high-dimensional data.

XGBoost

XGBoost (Extreme Gradient Boosting) is another popular ensemble learning algorithm that excels in predictive tasks. It is an optimized implementation of the Gradient Boosting algorithm, which combines weak learners sequentially to form a strong learner. XGBoost uses a gradient descent algorithm to minimize a loss function and improve prediction performance. It offers better speed and efficiency compared to traditional Gradient Boosting algorithms.

Comparative Analysis

To compare the performance of these four algorithms, we split the dataset into training and testing sets, trained each model on the training set, and evaluated their performance on the testing set using relevant evaluation metrics such as accuracy.

Our results indicate that Random Forest outperformed the other three algorithms in terms of accuracy. Combined with the diversity introduced by random feature selection and bagging, helps it capture complex patterns and make robust predictions.

Predictive Modelling on the hotel reservations dataset

You can access the dataset here… https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

We will first check for the correlation between the variables before feature selection. Also, we will test the four different models on which performed better and then use that model for prediction.

First is to upload the dataset and then import all the necessary libraries.

df = pd.read_csv("/kaggle/input/hotel-reservations-classification-dataset/Hotel Reservations.csv")

from sklearn.datasets import make_classification
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder,LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from pandas import DataFrame
import numpy as np

With our categorical label, there is a need for a Label Encoder.

df = df.rename(columns={'booking_status': 'is_canceled'})
df['is_canceled'].replace('Canceled', '1',inplace=True)
df['is_canceled'].replace('Not_Canceled', '0',inplace=True

We have to change the datatype of the encoded variable from string to an integer:

df['is_canceled'] = df['is_canceled'].astype(int)

hotel_df=train_set.copy()

What is the correlation between the variables?

We can analyse this using a correlation plot like this one:

hotel_df_corr=hotel_df.corr()

corr_df= DataFrame(hotel_df_corr['is_canceled'].abs().sort_values(ascending=False))

sns.heatmap(hotel_df_corr)

Feature Selection
We will need to choose numerical and categorical columns for our model.

numeric_cols=['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'required_car_parking_space', 'lead_time',
       'arrival_year', 'arrival_month', 'arrival_date', 'repeated_guest',
       'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
       'avg_price_per_room', 'no_of_special_requests']

cat_cols=[ 'type_of_meal_plan', 'room_type_reserved',
       'market_segment_type']

Define the target variable

hotel_df_actual=hotel_df[numeric_cols + cat_cols]
hotel_df_y=hotel_df['is_canceled']

Numeric pipeline

Change the values in categorical to numerical values and filling in missing values in both numeric and categorical variables.

num_transformer=SimpleImputer(strategy="constant",fill_value=0)

#num_transformer.fit_transform(hotel_df[numeric_cols])
cat_transformer=Pipeline([('imputer',SimpleImputer(strategy="constant")),('ordinal',OrdinalEncoder())])

#combine the two so we can use it for other data sets

col_trans=ColumnTransformer([("num",num_transformer,numeric_cols),("cat",cat_transformer,cat_cols)])

col_trans

Next, we will split the data into a Train/Test split and then using the training set to train the model.

X_train,X_test,y_train,y_test=train_test_split(hotel_df_actual,hotel_df_y,test_size=0.2,random_state=42,stratify=hotel_df_y)

Logistic Regression Model

model=LogisticRegression(random_state=42,n_jobs=-1)

model_steps=Pipeline([('col_trans',col_trans),('model',model)])

model_steps.fit(X_train,y_train)

y_pred=model_steps.predict(X_test)

model_steps.score(X_test,y_test)

0.7953135768435562

Cross-validation

Then, test the accuracy of the cross validation set using the in built function to evaluate the estimator performance.

cv_results=cross_val_score(model_steps,X_train,y_train,cv=5,n_jobs=-1,scoring='accuracy')
cv_results

array([0.77971576, 0.80142149, 0.78785268, 0.7964678 , 0.78677579])

cv_results=[x for x in cv_results if str(x) !="nan"]
np.mean(cv_results)

0.7904467061915885

So now we want have our base model, now let's choose the models we want to use then work on those.

clf1=LogisticRegression(penalty='l2',C=0.001,random_state=42)
clf2=DecisionTreeClassifier(criterion='entropy',max_depth=4,random_state=42)
clf3=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=42)
clf4=xgb.XGBClassifier(n_estimators=100,learning_rate=0.01,max_depth=4,random_state=1,use_label_encoder=False)

models_name=['Logisstic Reg','Decision Tree','RandomForest','XGB']

for clf,name in zip([clf1,clf2,clf3,clf4],models_name):
  model_steps=Pipeline([('col_trans',col_trans),('model',clf)])
  scores=cross_val_score(model_steps,X_train,y_train,cv=4,n_jobs=-1,scoring="accuracy")
  scores=[x for x in scores if str(x) !="nan"]
  print('accuracy:{:.2f},{}'.format(np.mean(scores),name))

accuracy:0.78,Logisstic Reg
accuracy:0.82,Decision Tree
accuracy:0.90,RandomForest
accuracy:0.84,XGB

Our best Model is Random Forest, so let’s take that and do some hyperparameter tuning on it.

RF=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=42)

model=Pipeline([('col_trans',col_trans),('model',RF)])

model.fit(X_train,y_train)

model.get_params()

clf3=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=42)
model=Pipeline([('col_trans',col_trans),('model',clf3)])

param_range=[100,160]
#param_criterion=['gini','entropy']
param_grids=[{'model__n_estimators':param_range,'model__criterion':['entropy']}]
             #{'model__n_estimators':param_range,'model__criterion':['entropy'],'model__min_samples_split':param_min_samples_split}]

gs=GridSearchCV(estimator=model,param_grid=param_grids,scoring='accuracy',cv=10,refit=True,n_jobs=-1)

gs=gs.fit(X_train,y_train)

Evaluated the baseline hyperparameters on the validation set to get these metrics


print(gs.best_score_)

print(gs.best_params_)

0.8981306321601703
{'model__criterion': 'entropy', 'model__n_estimators': 160}

clf3=RandomForestClassifier(n_estimators=160,max_features=0.4,min_samples_split=2,n_jobs=-1,random_state=42)
model_clf=Pipeline([('col_trans',col_trans),('model',clf3)])

cv_result_score=cross_val_score(model_clf,X_train,y_train,cv=5,scoring='accuracy',n_jobs=-1)
cv_result_score=[x for x in cv_result_score if str(x) !="nan"]

np.mean(cv_result_score)

0.8964936055369768

Display Pipeline

model_clf.fit(X_train,y_train)

Finding feature importance

num=model_clf.named_steps['col_trans'].transformers_[0][2]
cat=model_clf.named_steps['col_trans'].transformers_[1][2]
features=num+cat


feat=DataFrame(features)
weight=DataFrame(model_clf.steps[1][1].feature_importances_)
imp_df=pd.concat([feat,weight],axis=1,keys=['Feature','Weight'])
imp_df=imp_df.droplevel(1,axis=1)
imp_df.sort_values("Weight",ascending=False).head(10)

Confusion Matrix

X_valid=test_set[numeric_cols + cat_cols]
y_valid=test_set['is_canceled']


from sklearn.metrics import confusion_matrix
model_clf.fit(X_train,y_train)
ypred=model_clf.predict(X_valid)
confmat=confusion_matrix(y_valid,ypred)
print(confmat)



import seaborn as sns
fig,ax=plt.subplots(figsize=(5.5,5.5))
sns.heatmap(confmat,annot=True)
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')

[[4579  260]
 [ 443 1973]]

Accuracy Score

from sklearn.metrics import auc,accuracy_score

model_clf.fit(X_train,y_train)
ypred=model_clf.predict(X_valid)
accuracy_score(y_valid,ypred)

0.9031013094417643

Conclusion

In this follow-up blog, we compared four popular machine learning algorithms — Logistic Regression, Decision Tree, Random Forest, and XGBoost — for hotel cancellations prediction. While all four algorithms have their advantages, our analysis demonstrated that Random Forest emerged as the best-performing model in terms of prediction accuracy.

It is always advisable to experiment with multiple algorithms and conduct thorough evaluations to select the best model for a specific prediction task.

Thank you for reading up to this point! Please like it if it helps and follow me on linkedIn here https://www.linkedin.com/in/lisa-asafo-adjei-377901196/

Phase 1…https://medium.com/@asafoadjeilisa/navigating-the-turbulent-waters-of-hotel-booking-cancellations-the-menace-solution-phase-1-9005c71cba23

Navigating the Turbulent Waters of Hotel Booking Cancellations: Predictive Modelling (Phase 2)

Predictive Modelling on the hotel reservations dataset

Written by Lisa Asafo-Adjei