Learning Pitstop: Predicting hotel booking cancellations using Classification Techniques
Motivation behind our project
Classification, a form of supervised machine learning technique, is often used to get a predicted result from population data. There are many classification algorithms in machine learning such as Random Forest, K Nearest Neighbors, and XGBoost. Being new to the field of data science, our group has decided to delve deeper into the area of classification, and we have chosen to perform predictions on hotel booking cancellations.
We have obtained the Hotel Booking Demand Datasets from Kaggle as it is suitable for building predictive models. This dataset was written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019, and it consists of hotel booking data and transactions from 2015 to 2017.
It contains booking information for a city hotel and a resort hotel, and includes information such as, when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
Exploratory Data Analysis
According to D-Edge Hospitality Solutions, part of the Accor-owned hotel technology group, global hotel cancellation rates on bookings have reached 40% on average. Booking cancellations are always a headache for hotels, in turn causing hotels to lose profits. It is thus imperative from a cost-saving perspective to find out what causes hotel booking cancellations to rise, and how to mitigate this rise.
To better understand the dataset, we have come up with a list of questions to answer.
- How many bookings were canceled?
- Which month has the highest number of visitors?
- What is the monthly average daily rate per person over the year?
- Which country has the most number of hotel visitors?
- Which customer type contributes to the most hotel booking cancellations?
- Which month has the highest number of cancellations?
Out of 119,390 hotel bookings, a total of 37.04% of bookings were canceled.
We have identified that there the most number of bookings occurred in the month of August whereby the least number of bookings is in January.
We can observe a steady increase in bookings from January to August, with a slight dip in June and peak in August. This is followed by a sharp fall in bookings from August back to January.
The monthly average daily rate per person (ADR PP) closely follows the volume of bookings per month, with the highest ADR PP in August and lowest ADR PP in January.
Most of the hotel visitors are Europeans, with a majority coming from Portugal (PRT), followed by Great Britain (GBR), France (FRA), Spain (ESP) and Germany (DEU)
The largest percentage of customers who cancel bookings belong to the Transient customer type (Transient — when the booking is not part of a group or contract and is not associated with other transient bookings).
This is not surprising, given that transient customers also make up the largest percentage of guests in both hotels in this dataset.
August sees the highest number of cancellations. This is also unsurprising, given that the number of cancellations per month closely follows that of the volume of bookings per month.
We have utilized pandas for our data processing. The steps are as described below:
1. Checking and handling of missing and erroneous values
df.isna().sum().sort_values(ascending = False)
We have observed that ‘company’, ‘agent’, ‘country’ and ‘children’ have non-applicable or missing values. However, since 112593 observations actually make up 94% of our dataset, we decided to drop the ‘company’ column. Furthermore, we decided not to include the “agent” and “country” columns in our classification model, to increase generalisability.
As the ‘children’ column may be useful for predicting hotel cancellations (e.g. guests with children may be less likely to cancel hotel bookings), we decided to replace the null values with 0. We assume that if the values in the ‘children column were not indicated, guests did not have any children.
df2["meal"].replace("Undefined", "SC", inplace=True)zero_guests = list(df2.loc[df2["adults"]
+ df2["babies"]==0].index)df2.drop(df2.index[zero_guests], inplace=True)
To standardize the dataset values, we replaced rows that contain “Undefined” values for the “meal” column with “SC”, as they mean the same thing (mentioned by the contributor of the dataset).
Additionally, certain bookings have no hotel guests at all, thus we drop the rows with zero guests.
2. Feature selection
A. Dropping irrelevant features
After inspecting the dataset and taking into account the correlation of each numerical feature with ‘is_canceled’, we feel that the ‘arrival_date_year’, ‘arrival_date_day_of_month’, ‘reservation_status’, ‘stays_in_weekend_nights’ and ‘reservation_status_date’ columns are also irrelevant to our analysis, and hence we dropped those columns.
C. Filtering out the numerical features from the categorical features
numeric_features = df3.select_dtypes(include=['int64', 'float64']).drop(['is_canceled'], axis=1).columnscategorical_features = df3.select_dtypes(include=['object']).columns
Prior to the transformation of our data, we stored separate lists of our numerical and categorical columns using the pandas dtype method.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformernumeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Next, we used the ColumnTransformer to apply transformations to the numerical and categorical columns. Sklearn Pipelines were used to simplify the transformation process.
The numeric transformer applies StandardScaler to normalize our numerical columns. Normalisation is necessary as the features have different ranges, for example, ‘lead_time’ has a range of 0 to 737 while ‘children’ has a range of 0 to 10.
The categorical transformer, on the other hand, utilizes OneHotEncoder to transform categorical data into numerical values. Encoding categorical features is necessary as Machine Learning models require all input and output variables to be numeric.
4. Split the dataset into train and test sets
We then proceed to split the dataset into training and testing sets for model building in the next steps. Splitting the dataset into training and testing sets is important as it helps us to reduce the error of over-fitting or under-fitting and optimise our model such that it is able to identify important and unique patterns in the dataset and is of good validation for similar datasets.
5. Fitting Classifiers and Model Selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifierclassifiers = [
]for classifier in classifiers:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
print("model score: %.3f" % pipe.score(test_X,test_y))
There are many classification algorithms for machine learning. We decided to make use of the following 4 algorithms to predict hotel cancellations:
Out of the 4 common classification algorithms, the model score from Random Forest is the highest at 0.865.
6. Extracting Feature Importance
import eli5pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100))])onehot_columns = list(pipe.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names(input_features=categorical_features))numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)eli5.explain_weights(pipe.named_steps['classifier'], top=50, feature_names=numeric_features_list)
We made use of eli5 library to investigate which features have the highest weightage or importance in our Random Forest model. From the above, we can observe that ‘lead_time’, ‘adr’, and ‘deposit_type’ are the top 3 features affecting hotel cancellations.
From the dataset, we found that 99% of customers who had a non-refundable deposit canceled their bookings. However, this is unlikely to be true as customers with refundable deposits should be more likely to cancel their bookings instead. 99% does not seem realistic, and hence we believe the data might be flawed in this aspect.
Diversify Your Customer Mix
From our EDA we observed that a majority of the hotel customers were transient customers, with less than 5% of bookings from groups or contracts. While this might not be true for all hotels, hotels included in this dataset, specifically, should consider diversifying their customer mix by investing more resources in establishing contracts with event companies or attracting travel groups.
Similarly, the hotels should not rely only on Portuguese visitors and should explore strategies to attract more visitors from other countries such as France, Great Britain, Germany, and Spain, where they already have some form of presence, or even from countries outside of Europe.
Overall, diversifying their customer mix and offerings would better protect hotels from unsystematic risks.
From observing the monthly number of bookings across the year, Hotels should find ways in which they can increase sales and bookings in off-peak periods, such as January. Apart from seasonal pricing, these strategies can include, for example, short-retreat or staycation packages for local customers, special rates for loyal customers or organizing conferences or special events such as weddings.
Hotels should also dive deeper into operational processes during their peak season, such as August, to ensure that they are no bottlenecks that are preventing them from maximising their occupancy rate during this season.
The bookings trends, combined with the booking cancellation predictions, can help hotels better forecast their occupancy rates in the different months, allowing for better cost control.
In addition, hotels can also make use of feature importance to select the most relevant and important features to predict hotel cancellations in the future. For example, from our analysis, the top 5 important features are lead time, ADR, deposit_type_Non_Refund, arrival_date_week_number, total_of_special_requests.
If hotels are able to make use of factors that are within their control to implement measures such as improving their services or setting a favorable price, they will be able to reduce the frequency of booking cancellations.
Strategies to tackle booking cancellations
1. Using arrival_date_week_number
Hotels can look into the specific time periods when cancellation rate is high and price accordingly. During periods with high cancellation rates, hotels can oversell by implementing some time-related discounts or certain cancellation conditions to optimise the occupancy rate. Implementing this strategy would help to mitigate the consequences of last-minute cancellations.
2. Using lead_time
Looking into the lead time can help to reduce hotel cancellations. Hotels can consider changing their cancellation conditions, through setting a maximum advance booking restriction. For instance, guests can only book rooms at most 2 months in advance, hence reducing high lead times and a reduction in the number of cancellations. However, it is important to note that setting such restrictions could have possible negative consequences on the hotel’s sales and bottom line.
3. Using Total_of_special_requests
Hotels can take into account the total number of special requests to reduce the possibility of cancellations. They can do so by improving their customer service, such as being able to cater to most of the customers’ special requests.
We have made use and learned the various techniques throughout this entire project:
- Data Transformation using Pipeline, OneHotEncoding, StandardScaler and ColumnTransformer
- Building various classification models such as RandomForest, K-Nearest Neighbours, AdaBoost, GradientBoost
3. Extraction of Feature Importance using eli5 library
The project’s full code can be viewed via https://github.com/ZiRong27/TLGProject. The project is part of The Learning Pitstop Programme under Tech For She. The Learning Pitstop is a 4-week long learning sprint for anyone to develop essential industry-ready skills in the sphere of UIUX, Front End Development, Data Analytics, Data Science and etc. Stay tuned for upcoming The Learning Pitstop programmes! For more information, follow our Medium blog at TechForShe, or follow us on Instagram/Facebook at @techforshe.
EDA of bookings and ML to predict cancelations
Explore and run machine learning code with Kaggle Notebooks | Using data from Hotel booking demand
Pros and cons of various Classification ML algorithms
There are many classification algorithms in machine learning. But ever wondered which algorithm should be used for what…
A Simple Guide to Scikit-learn Pipelines
Learn how to use pipelines in a scikit-learn machine learning workflow
Global Cancellation Rate of Hotel Reservations Reaches 40% on Average
D-Edge research finds guests have become accustomed to free cancellation policies that have been made popular by OTAs.
3 Ways to Encode Categorical Variables for Deep Learning - Machine Learning Mastery
Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric…
A Super Quick Guide to Randomized (or Grid) Search with Pipeline
Five steps using scikit-learn
Extracting Feature Importances from Scikit-Learn Pipelines
A simple way to evaluate feature importances in pipelines