Hotel Reservations Data Study

4 min readJun 26, 2020

We always thought we knew what it meant to be in a ‘high season’ and ‘low season’ when going on vacation but are these terms real or is it just an invented term for businesses to be able to increase and decrease their prices as they wish.

I set out to prove with the power of data and the fresh skilled I learned from Mr. Galarnyk’s class that ‘high season’ and ‘low season’ might be myths creating by the industries.

The dataset was extracted from kaggle:
https://www.kaggle.com/jessemostipak/hotel-booking-demand

The data in the csv is explained here:
https://www.sciencedirect.com/science/article/pii/S2352340918315191

The data represents hotels reservations made in 2015, 2016, and 2017. You can find all the information that is provided in the data using the link provided by sciencedirect.

The first thing to do was to import the data and remove all of the null values by replacing them by an appropriate value so that we wouldn’t have any issues later while reading it.

hotel_bookings.isnull().sum()
hotel_bookings.children.fillna(0, inplace=True)
hotel_bookings.country.fillna(‘’, inplace=True)
hotel_bookings.agent.fillna(-1, inplace=True)
hotel_bookings.company.fillna(-1, inplace=True)
hotel_bookings.isnull().sum()

This code allowed us to check for null values and replace them and then check again to see if the null values were dealt with.

Reservation Periods

The first piece of data that was intereseting to obtain was the periods that people were reserving for, when did they want to go on vacation?

month_order = [‘January’, ‘February’, ‘March’, ‘April’, ‘May’, ‘June’, ‘July’, ‘August’, ‘September’,
‘October’, ‘November’, ‘December’]
# seperating the years 2015,2016,2017
bar_2015 = hotel_bookings.loc[hotel_bookings.arrival_date_year == 2015]
bar_2016 = hotel_bookings.loc[hotel_bookings.arrival_date_year == 2016]
bar_2017 = hotel_bookings.loc[hotel_bookings.arrival_date_year == 2017]
# Sorting based on month
bar_2015 = month_order_sort(bar_2015, month_order)
bar_2016 = month_order_sort(bar_2016, month_order)
bar_2017 = month_order_sort(bar_2017, month_order)
# counting the amount of reservations per month
bar_2015_months = month_counts(bar_2015.arrival_date_month)
bar_2016_months = month_counts(bar_2016.arrival_date_month)
bar_2017_months = month_counts(bar_2017.arrival_date_month)
# Plotting using 2015,2016,2017 in a bar plot graph
pd.concat([bar_2015_months.rename(‘2015’), bar_2016_months.rename(‘2016’), bar_2017_months.rename(‘2017’)], axis=1).plot.bar()

In 2015, most people reserved for the winter, or maybe the data didn’t show when people were reserving for the other seasons.

In 2016 and 2017, there were no trends, people did not prefer reserving at a specific period but there wasn’t a significant difference between each month.

Lead Time

Lead time represents how many days in advance people were reserving.

# initializing list
average_lead_time_2015 = []
average_lead_time_2016 = []
average_lead_time_2017 = []
# iterating through months to append to list based on month
for month in month_order:
average_lead_time_2015.append(bar_2015[‘lead_time’].loc[bar_2015.arrival_date_month == month].mean())
average_lead_time_2016.append(bar_2016[‘lead_time’].loc[bar_2016.arrival_date_month == month].mean())
average_lead_time_2017.append(bar_2017[‘lead_time’].loc[bar_2017.arrival_date_month == month].mean())

# converting to pandas Series
average_lead_time_2015 = pd.Series(average_lead_time_2015, index = month_order)
average_lead_time_2016 = pd.Series(average_lead_time_2016, index = month_order)
average_lead_time_2017 = pd.Series(average_lead_time_2017, index = month_order)
# plotting
pd.concat([average_lead_time_2015.rename(‘2015’),
average_lead_time_2016.rename(‘2016’),
average_lead_time_2017.rename(‘2017’)],axis=1).plot.bar()

We can obviously see that the industries myth about there being a high season is very effective since the graph shows people reserving up to an average of 120 days for the high season (summer and october and september) where as in january the average is around 30–40.

Reservations Canceled

Amount of reservations canceled based on the month the reservation was for. We often think that by reserving beforehand, we save money on hotel and flight fees. I wanted to see if the amount of reservation that were canceled were in the ‘high season’ period.

# initializing list
canceled_2015 = []
canceled_2016 = []
canceled_2017 = []
# iterating through months to append to list based on month
for month in month_order:
canceled_2015.append(bar_2015[‘is_canceled’].loc[bar_2015.arrival_date_month == month].sum())
canceled_2016.append(bar_2016[‘is_canceled’].loc[bar_2016.arrival_date_month == month].sum())
canceled_2017.append(bar_2017[‘is_canceled’].loc[bar_2017.arrival_date_month == month].sum())

# converting to pandas Series
canceled_2015 = pd.Series(canceled_2015, index = month_order)
canceled_2016 = pd.Series(canceled_2016, index = month_order)
canceled_2017 = pd.Series(canceled_2017, index = month_order)
# plotting
pd.concat([canceled_2015.rename(‘2015’),
canceled_2016.rename(‘2016’),
canceled_2017.rename(‘2017’)],axis=1).plot.bar()

Again the trend is obvious here, people tend to cancel their reservations for april, may, june, july. The ‘summer high season’ period. Maybe the industry also make their money on cancellation fees.

Conclusion

This data study wasn’t everything I expected, I expected a more significant difference between the low season and the high season. Nonetheless, I was able to redeem the hypothesis I initially set by looking at the Lead time and Reservations canceled. These 2 bar graphs showed obvious trends that prove that people do get worried about the high and low season and try to reserve a long time in advance to get the best prices. They also end up cancelling these reservations and never get to benefit from the high lead time before the reservation date.

Hotel Reservations Data Study

Reservation Periods

Lead Time

Reservations Canceled

Conclusion

Written by Ralph K.