Building a hotel recommendation model

11 min readAug 19, 2020

Problem Description

In this problem, expedia has asked us to make hotel recommendations for their users. The data consists of 37 million clicks and books by 1.2 million unique users between 2013 to 2014. When a user is ready to book a hotel, we want to be able to predict which hotel the user will book. This will allow expedia to highlight the hotel it thinks the user will book, provide email marketing, and so on. Kaggle sponsored this dataset and problem.

In the first section, I will explain the data in more detail. In section 2 I will explore the data. In the third section, I will show how I modeled the problem, and what the results were.

Data Background

We have the following data:

What did the user search for? I.e what location, what timeframe, how many kids, etc.
Environment user searched in— i.e was is on a mobile device, what expedia site was the user on, etc.
Details about the user — where is he/she located, how far is he/she from the hotel being searched for.

The response variable we are trying to predict is hotel_cluster. This variable represents a hotel, and to make the problem easier, expedia has clustered the hotels into 100 different groups, which represent similar hotels across attributes such as price, distance from city center, etc.

Since the data consists of clicks/books, we only want to predict hotel_cluster when it is a book.

Data Exploration

User exploration

When a user books a hotel, we might be interested in how much activity the user has had prior to the booking. The following plot displays that users who book have more prior clicking activity than those who do not.

Number of clicks in users who book vs those who don’t

We can see that for users who did not book, about 80% had between 0–20 click events, compared to users who did book of which 40% had between 0–20 click events. We can see that as the number of click events increases, that users who did book make up a larger portion of the distribution. Therefore we can say that in general users who do book have more click events than users who do not.

In the following plot, we look at the distribution of number of times a user makes a booking within the timespan of the data (1 year).

Distribution of number of bookings by user

We can see that ~40% make just 1 booking. About 30% make 2–3 bookings, while ~30% makes 4+ bookings.

Next, we look at the distribution of distance from the user to the hotel being searched for.

Distribution of origin to destination distance

We can see that generally, users are relatively close to the hotel they are searching for, with a long tail of people relatively further away. It has a slight bimodal nature with a peak around 1000 miles, a drop at 3000 miles, and another slight peak around 5000 miles. It may mean people search either close or far but not as much in between.

We can also see that users are more likely to book if they search for a hotel closer to where they are.

Distribution of mean distance to destination by is_booking

In the above plot, we compare the mean distance to all hotels searched for from a user for the population that made a booking vs. those that did not. We can see that a much higher percentage of hotels close to the origin resulted in a booking. We can see between 1000–2000 miles a higher percentage results in no booking, while further than this is about even in bookings vs no bookings.

Hotel exploration

In the following plot we look at each hotel type’s popularity in terms of click rate (fraction of times the hotel was clicked vs all clicks), book rate (fraction of times the hotel was booked vs all books), and conversion rate (fraction of time the hotel was booked vs number of clicks for that hotel). The following plot is sorted descending by book rate.

We can see some interesting things. Hotel 91(the first bar) has relatively higher rates than all other hotels. Generally a higher click rate correlates with higher book rate and conversion rate. There is one interesting outlier, hotel 65 which has relatively low conversion rate by high click rate. This means that for this hotel type, many people are clicking but few are booking.

In the next plot we look at the distribution of distances from hotels grouped by bookings.

Distribution of distance from hotel by is_booking

We can see a similar trend as described above that in general the further away you are from the hotel, the less likely you are to book. We can see an extreme example with the first hotel (27) in which the median distance of those who did not book is much greater than those who did. We can see that in general the median distance across all hotel types is greater for those who don’t book than those who do. There are a few exceptions to this. In hotels 86 and 44 for example the users who do book are located on average further away than those who do not.

Finally, we look at book rates in those who want to stay in a hotel with kids vs those without kids.

Distribution of book rate in those with vs without kids

We can see that overall the book rate is higher in those without kids. There are a few exceptions but overall this is true .There are some hotels in which the difference is pretty big.

Modeling

The purpose of the model is to predict which hotel (out of the 100 hotel clusters) a user will book.

Performance metric

The offline performance metric used in this problem is MAP@K. This stands for mean average precision @ K. The way it works is as follows. There is a single correct hotel that the user booked. We provide up to 5 predictions, so k = 5. The predictions need to be sorted from most confident to least. If the first is correct, it results in precision of 1.0. If the second is correct, it results in precision 1/2 = .5. If the third is correct, 1/3 = .333, and so on. If none are correct, average precision = 0.0. We then take the mean of the average precisions across all records.

Here is the formal definition:

Evaluation

For all models, I evaluate performance using cross-validation. In the first 2 models, I use 5-fold cross validation, while in the last model I used 5-fold timeseries validation.

Model 1: Baseline

The simplest method I could think of to predict which hotel a user will book is to look at hotel popularity. In particular, I looked at popularity by srch_destination_id. srch_destination_id can be thought of as a level in a taxonomy of locations. Examples include “New York and vicinity”,
“New York City”, and “JFK Airport”. The reason why I grouped by this variable is because hotel popularity might differ between different srch_destination_id.

Popularity is defined as follows:

P = num_bookings + click_weight * num_clicks

The reason why it was defined as such is in order to experiment with weighting bookings differently from clicks.

For each srch_destination_id we predict the top 5 most popular hotels.

Cold start problem

Because srch_destination_id was used as a grouping, if a given srch_destination_id does not exist in the training set, then we cannot calculate popularity. To fix this, we just predict the top 5 most popular hotels overall for cases in which the srch_destination_id does not exist in the test set.

Results

We can experiment with different values of click_weight.

MAP@K for different values of click_weight

We can see that as click_weight is increased, performance decreases a little bit. The best value of click_weight is near 0, with MAP@5 = ~.328.

Model 2: Collaborative Filtering

Note: for this experiment, I randomly sampled 1% of users due to computation space and time. This resulted in ~12,000 users and ~367,000 interactions.

The idea behind this model is that a user who clicks/books on certain hotels is similar to another user who clicks/books on similar hotels. Therefore if a user A clicks/books on certain hotels that a similar user B does not click on, maybe user B would like the hotels that user A has clicked/booked.

For this problem, we build a sparse matrix of dimensions 367,000 x 100. The matrix values encode how much a given user interacted with a given hotel. We calculate it the same way as in the previous model.

P = num_bookings + click_weight * num_clicks

We then use the NMF algorithm in sklearn, which results in 2 low-rank rank K matrices W and H which when multiplied together approximates the original matrix. The low rank matrices capture latent features in the original data in order to make predictions on unseen user/hotel combinations. We can get a prediction of how much a user will like any hotel by multiplying the matrices together.

Cold start problem

The cold start problem occurs here because if a user/hotel is not in the training set but is in the test set, then we cannot make a prediction. To address this problem, we take the average over each column of the H matrix, which represents hotel popularity for each hotel, and then find the top 5 columns with the highest values. These become the “most popular” hotels.

Results

I evaluated using different values of K as well as click_weight on a 5-fold cross validation.

The results show that a higher click_weight of 1 is preferred as well as a higher n_components of 50 is preferred.

The performance is close to the baseline model but is actually not as good.

Model 3: Supervised Learning model

Whereas the previous model is an unsupervised learning model, this model is a supervised learning model. Both the previous models did not make use of all the additional data available. It did not make use of the search terms (when is the requested booking, how many adults, how many kids, how many rooms) as well as when is the search being made (what day of week, month, season), as well as environment data such as is the user on a mobile device and how did they land on the site. This seems like valuable information to incorporate.

The approach I used was to model it as a multi-class classification problem where the classes are the hotels being booked. The features are described at a high level. In addition, I have engineered fairly sophisticated features representing historical interactions across hotels for a given user up until the time of booking.

Note: I only used 1% of available users as before in order for the problem to fit in memory and to speed up computation.

Feature Engineering

Historical interactions

The historical interaction features are described in more detail here. To calculate a single interaction with a hotel, I use the same equation as above:

P = num_bookings + click_weight * num_clicks

I keep a cumulative sum of all interactions up to but not including a new interaction. For example if a user has the following interactions:

click

book

with a click_weight of 1.0, then the first interaction has a value of 0, the second 1, the third 2, and so on.

In order to normalize the interaction values, I take the sum of all cumulative interaction counts across all hotels the user has interacted with as the denominator and divide each value by this number. For example, if a user has interaction values of 1 for hotel 1, 2 for hotel 2, and 3 for hotel 3, the normalized interaction values would be 1/3 for hotel 1, 2/3 for hotel 2, and 3/3 for hotel 3. In this way all interaction values are between [0, 1].

We can also limit the cumulative sums to a given time frame. I tried to limit by “day”. This limits the cumulative sum to only the day of a hotel booking.

Other features

I also engineered other features. For example, the month, day of week and season the user is making a search, as well as the month and season the user wants to book on.

Evaluation

I evaluated the model using a time-series split in which the validation set always falls after the training set. This is to make validation more realistic as we wouldn’t want to predict the past, only the future. I also used a held-out test set to evaluate performance. The train/test split is .67 to .3.

Results

I evaluated a random forest model using different hyperparameters. I also tried both historical interaction features from all time vs just from the day of booking. The following are the results.

I tried different values of max_depth and max_features. The model performed best with unlimited depth and number of features per tree of sqrt(# features) which is 14.

We can see that including only interactions from the day significantly improves performance. This indicates that daily interactions are highly indicative of a booking, whereas historical not as much.

We can also see that the random forest overfits this problem and achieves nearly perfect performance on the training set.

The best CV performance is .81, while on the test set it is .82.

The following are the top 10 most important features using random forest feature importance metrics.

The features with numbers are the historical user interactions with hotels. We can see several of these are important. Hotels 91, 48 are the 2 most popular hotels, and so it applies to the most records, which is why they are most important. Other important features include which srch_destination_id a user is searching for, where a user is located, as well as how far away the user is from the hotel searched for.

Conclusion

In this problem, I tried to solve the hotel recommendation problem for expedia. The problem was to predict which hotel a user would book out of 100 different hotel clusters. Three different models were tried:

Popularity-based
Collaborative filtering
Supervised learning

The supervised learning model performed by far the best with .82 MAP@5 compared with .32 MAP@5 with the popularity based model. This probably is due to the fact that we consider recency of interactions, which highly impacts performance. We also consider many other factors, such as where the user lives and how far away the hotel search is from where the user lives.

Further steps

The first thing to tackle as further steps is to try to use the entire dataset in the last 2 problems. I could not fit the problem into memory on my local machine. We could use a larger machine or a spark cluster for example.

We could also do an error analysis to see where the model does not perform well.

Finally, we could perform A/B tests on the models to confirm a statistically significant increase in book rate.

Code

aamster/expedia-kaggle

In this problem, expedia has asked us to make hotel recommendations for their users. The data consists of 37 million…

github.com

Data

https://www.kaggle.com/c/expedia-hotel-recommendations/data

Building a hotel recommendation model

Problem Description

Data Background

Data Exploration

User exploration

Hotel exploration

Modeling

Performance metric

Evaluation

Model 1: Baseline

Model 2: Collaborative Filtering

Model 3: Supervised Learning model

Evaluation

Conclusion

Further steps

Code

aamster/expedia-kaggle

In this problem, expedia has asked us to make hotel recommendations for their users. The data consists of 37 million…

Data

Written by Adam