Exploratory Data Analysis of the Hotel Booking Demand with Python

Published in

Analytics Vidhya

9 min readApr 12, 2020

Dataset:

We will use the Hotel Booking Demand dataset from the Kaggle.
You can download it from here:
https://www.kaggle.com/jessemostipak/hotel-booking-demand

This data set contains booking information for a city hotel and a resort hotel and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has from the data.

We will perform exploratory data analysis with python to get insight from the data.

We will try to answer the following Questions

How Many Booking Were Cancelled?
What is the booking ratio between Resort Hotel and City Hotel?
What is the percentage of booking for each year?
Which is the busiest month for hotels?
From which country most guests come?
How Long People Stay in the hotel?
Which was the most booked accommodation type (Single, Couple, Family)?

After that, we will make the predictive model to make predictions in the future whether the booking will be canceled or not

We will:

Perform the Feature Engineering to make new features
Perform the Feature Selection to select only relevant features
Transform the Data (Categorial to Numerical)
Split the data (Train Test Split)
Model the data (Fit the Data)
And finally, Evaluate our model

Let’s start:

Import Packages

First Import necessary packages and import the dataset

Now import and display the dataset

Data Preprocessing

First copy the dataset, so our original dataset remains unchanged

1. Dealing with Missing Values

Check if our data contains any missing values

We have 4 features with missing values.

In the agent and the company column, we have id_number for each agent or company, so for all the missing values, we will just replace it with 0.

Children column contains the count of children, so we will replace all the missing values with the rounded mean value.

And our country column contains country codes representing different countries. It is a categorical feature so I will also replace it with the mode value. The mode value is the value that appears more than any other value. So, in this case, I am replacing it with the country that appears the most often.

There are many rows that have zero guests including adults, children and babies. These type of rows does not make

We have 180 such rows. we will just remove these rows.

2. Converting Datatype

Let’s check the datatype of each column in our dataset.

We can see different data types for different columns.

There are some columns like children, company, and agent, that are float type but their values are only in integers.

So we will convert them to the integer type.

Exploratory Data Analysis

Now let’s do the fun part, extract the information from our data and try to answer our questions.

1. How Many Booking Were Cancelled?

Let’s write the function to get the percentage of different values.

This function takes a series or data frame column and returns the two arrays

x is our unique values
y is the percentage value of each unique value

Now let’s use this function on our is_canceled feature and see the result

is_canceled have two unique values: 1 if booking got canceled, else 0.

Now let’s plot this result. I will write another function to plot the diagram. The good thing about writing function is that we can reuse the code again and again.

This function takes two arrays, x, and y and displays the required diagram. The default plot type is a bar plot, but it can also plot the line plot. Optional arguments can be given to display title and labels.

Now let’s call the function

Bookings got canceled 37% of the time. While booking guest did checkd-in (did not cancel the booking ) almost 63% of the time.

For further analysis, We will select only those bookings which did not get canceled

2. What is the booking ratio between Resort Hotel and City Hotel?

Let’s answer another question, how many bookings were made for each type of hotel.

We can now reuse the functions that we created earlier. All we have to do is to pass the dataframe column to get_count() function and pass its result (x and y array) to plot function.

More than 60% of the population booked the City hotel

3. What is the percentage of booking for each year?

More than double bookings were made in 2016, compared to the previous year. But the bookings decreased by almost 15% the next year.

Let’s separate it by the hotel and then plot the diagram. We will change our code to display the countplot.

Year-wise and Hotel-wise (side-by-side) comparison

4. Which is the busiest month for hotels?

To answer this question, we will select the arrival_date_month feature and get its value count. Now the resulting data will not be sorted according to month order so we have to sort it. We will make the new list with the names of months in order to sort our data according to this list.

We will display the Lineplot to display the trend.

Line plot to show the monthly hotel booking trend

As we can see most bookings were made from July to August. And the least bookings were made at the start and end of the year.

Let’s separate the data for each hotel type and then see the trend.

Line plot to show the monthly hotel booking trend (separate line for each hotel type)

We can see the trend is kind of similar with a small difference. Resort has more bookings at the start and end of the year, and lower booking in June and September.

5. From which country most guests come?

To see the country wise comparison plot the country column. In the country column, we have codes for each country like PRT for Portugal.

To get the country names we will use pycountry. pycountry is a very useful python package.
GitHub | PyPi

We will use this package to get country names from country codes

Portugal, UK and France, Spain and Germany are the top countries from most guests come, more than 80% come from these 5 countries.

6. How Long People Stay in the hotel?

Most people stay for one, two, or three. More than 60% of guests come under these three options.

Let’s see the stay duration trend for each hotel type.

Number of Nights People stay (For each hotel type)

For Resort hotel, the most popular stay duration is three, two, one, and four days respectively.
For City hotel, most popular stay duration is one, two, seven(week), and three respectively

7. `Which was the most booked accommodation type (Single, Couple, Family)?`

We will divide people staying in the hotel into 3 categories.

Single: 1 Adult only

Couple: 2 Adults we can’t say for sure that these two people are an actual couple or not, data does not tell us anything about this, but we will assume they are couple :P

Family or Friends: More than 2 people including adults, children, and babies. (or alternatively, we can call it a group)

Accommodation Type (Single, Couple, Family)

Couple (or 2 adults) is the most popular accommodation type. So hotels can make plans accordingly

Feature Selection and Feature Engineering

Before we start making a predictive model. let’s do the Feature selection and feature engineering. We will create more relevant features and remove irrelevant or less important features.

First, make a copy of dataframe.

Feature engineering is a very important part and a very difficult one. Take some time and try to think about what type of new features we can create from our existing features?

Now let’s create some new features.

We have two features in our dataset reserved_room_type and another is assigned_room_type. We will make the new feature let’s call it Room which will contain 1 if the guest was assigned the same room that was reserved else 0. Guest can cancel the booking if he did not get the same room. clever right?
Another feature will be net_cancelled. It will contain 1 If the current customer has canceled more bookings in the past than the number of bookings he did not cancel, else 0.

Now remove these unnecessary features

Let’s also remove the reservation_status. Even though it is a very important feature, but it already has information about canceled booking. Further, It can only have information after the booking was canceled or the guest checked in. So it will not be useful to use this feature in our predictive model. Because for the future prediction we won’t have information about the reservation status.

Let’s plot the heatmap and see the correlation

We can see our new features, Room and net_cancelled have a higher correlation with is_cancelled than most of the other columns.

Modeling

1. Converting Categorical variables to Numerical

Let’s convert categorical values into numerical form.

We will use LabelEncoder from Sklearn to encode in an ordinal fashion.

2. Train Test Split

Now let’s split the dataset into train and test. The default size of the split ratio is 3:1

3. Machine Learning Model (Decision Tree)

We will use the decision as our predicting model. Let’s fit the data.

4. Evaluation of the Model

Let’s now evaluate our model. We will print the training and testing accuracy

Aha! Almost perfect accuracy.

Let’s pick any random sample and try to make the prediction and compare it with the actual values

Predicted and Actual Value

Our model correctly predicted that the guest will not cancel the booking.

Conclusion:

We used the dataset that contains data about hotel bookings

We cleaned and preprocessed the data and then we performed the exploratory data analysis to extract information from the data to answer the following questions.

How Many Booking Were Cancelled?
What is the booking ratio between Resort Hotel and City Hotel?
What is the percentage of booking for each year?
Which is the busiest month for hotels?
From which country most guests come?
How Long People Stay in the hotel?
Which was the most booked accommodation type (Single, Couple, Family)?

We learned that

Almost 35% of bookings were canceled.
More than 60% of the population booked the City hotel.
More than double bookings were made in 2016, compared to the previous year. But the bookings decreased by almost 15% next year.
Most bookings were made from July to August. And the least bookings were made at the start and end of the year.
Portugal, the UK, and France, Spain and Germany are the top countries from most guests come, more than 80% come from these 5 countries.
Most people stay for one, two, or three.
-> For Resort hotel, the most popular stay duration is three, two, one, and four days respectively.
-> For City hotel, most popular stay duration is one, two, seven(week), and three respectively
Couple (or 2 adults) is the most popular accommodation type. So hotels can make arrangement plans accordingly

We then performed feature selection and feature engineering, and then made the predictive model using the Decision Tree to predict whether our customer/guest will cancel the booking or not. And we achieved 99% accuracy.

You can download the entire source code and dataset from the Github
https://github.com/aaqibqadeer/Hotel-booking-demand