New AirBnB User Booking Prediction — Using Machine Learning
One of the most valuable information for any company is knowing its user’s behavior. A good strategic decision is often made based on knowing the customer’s buying patterns and trends. Furthermore, the internet has connected the world and our data is being shared and used for every marketing campaign that is on the web. Data has became the most valuable commodity in the 21st century.
In this excerpt, I want to use machine learning to predict new AirBnb user’s behavior to classify if the user will make a booking within 5 days of signing up for an account. This blog explains the workflow of this project and some of the decisions that went into it. Enjoy!
You can find this project repo here.
I also hosted a flask app on Heroku, feel free to check it out here.
Problem Statement(s):
Can we use machine learning to uncover insights on new AirBnB users?
How can we use user behavior prediction to maximize revenue on a hypothetical marketing campaigns?
TL;DR
We started with an unbalanced dataset that we found on Kaggle. We used ADASYN to oversample our training set to feed into a basic logistics regression model. We decided to use the AUC score as our evaluation metric because we want to optimize the probability threshold according to our hypothetical business constraint. In conclusion, using our best performing model (CatBoost), we should advertise to users who have at least a 16% probability of making a booking to maximize the net revenue. Along the way, we uncovered that users signing up through the web app and specific signup flows yields a much higher conversion overall.
Here’s my approach:
- Data Preprocessing
- Model Selection
- Model Evaluation and Metrics
- Takeaways and Recommendations
- Flask App
Data Preprocessing
I used an old dataset on Kaggle for this project. You can find the dataset here. The original problem statement from Kaggle was predicting new user’s first time travel destination. I was more interested in if a first time user on AirBnb will make a booking or not.
Here are some of the features I used:
- date_account_created: the date of account creation
- timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
- date_first_booking: date of first booking
- gender
- age
- signup_method
- signup_flow: the page a user came to signup up from
- language: international language preference
- affiliate_channel: what kind of paid marketing
- affiliate_provider: where the marketing is e.g. google, craigslist, other
- first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
- signup_app
- first_device_type
- first_browser
Since we are deviating from the original Kaggle problem statement, we have to make our own output label. I decided that our positive class will be those who makes a booking within 5 days of signing up for an AirBnB account, otherwise they will be considered to be in the negative class. The amount of users who makes an account but also makes a booking significantly decreases after 5 days, thus, I used 5 days as the class separation threshold.
User who makes a booking within 5 days of signing up are labeled as 1.
User who doesn’t make a booking within 5 days of signing up are labeled as 0.
Using this definition, we have about 40,000 positive class (1) and 111,500 negative class (0). It is fairly common in practice for our positive class to be underrepresented. This is called an imbalanced dataset. This is specifically an issue on problems like fraud or other anomaly detections. A common technique to overcome imbalanced data is to oversample on the training dataset. We never want to oversample on our validation or testing set. For this project I used the ADASYN, which is short for Adaptive Synthetic Sampling Method. After splitting the data into train, validation, and test set, I applied ADASYN to the training set. We are now working with 66,880 positive and 69,685 negative labels. Finally, let’s get to modeling…
Model Training
It is best practice to always build a vanilla base model before applying any advanced machine algorithms. I used SciKit-Learn’s Logistic Regression package to train a base model. Here are the results:
As you can see, our model does better at predicting the negative class (0) than the positive class (1) due to an imbalanced distribution. For this project, we want to maximize our AUC score for our positive class (1). I will get into more details on why we chose this metric in the evaluation section below.
I moved on to use other classification algorithms such as Naive Bayes, KNN, Decision Tree, RandomForest, XGBoost, and CatBoost. I picked the top 3 performing models based on the AUC score and performed parameter tuning on all these models. I will leave the details on how to efficiently use GridSearchCV to tun the parameters on an imbalanced dataset in another post. Below are the results for the top 3 models with the highest AUC scores after parameter tuning.
There are a few things to considering when picking between models that have similar performances. I will go into more details in the next section, model evaluation.
Model Evaluation and Metrics
As mentioned above, we are primarily focused on maximizing the AUC score. AUC stands for Area Under the Curve. It is a measurement of model performance at various probability thresholds. The default probability threshold for a classification algorithm is set at 0.50. This means that if the model outputs a probability of 0.50 or greater, it will predict the positive class (1). On the contrary, if the model outputs a probability of 0.50 or less, it will predict the negative class (0). Where we set our threshold is a constraint specific to the business case. Before we define the hypothetical business case, let’s take a look at other things we might consider during model selection.
Model interpretability is one of most important things to consider when choosing your model. Ultimately, we want to be able to interpret the results of our model and make insightful recommendations about the problem. Of the three models (Logistics, XGBoost, CatBoost), logistics regression is by far the most interpretable.
Model training speed may be another factor to consider. One might want the training speed to be as fast as possible if retraining the model often is something of interest. Someone else might also want the predicting speed to be as fast as possible because the the speed of the prediction directly affects the flow of information on a web for a customer. In our case, we are more interested in the training time because we want to be able update our model periodically as more data becomes available. Here are the training times for the top models:
Again, logistic regression wins out on the speed but CatBoost isn’t terribly slow either.
Lastly, but most importantly, let’s connect the dots to our hypothetical AirBnB business problem.
How can we use user behavior prediction to maximize revenue on a hypothetical marketing campaigns?
Our goal is to maximize revenue for a hypothetical marketing campaign. Let’s put some context to this problem. After some research, here are some estimates I came up with (they are by no means 100% accurate):
About 2.2 million new users are predicted to be using AirBnB in 2020. 23% of all first time users are expected to make a booking — this is the sample of positive class in our sample dataset. AirBnB generates about $32 of the revenue per booking when the average booking price is $160 per night (17% fees come from bookers and 3% from the hosts). I also estimated an average of $5 for the cost of advertisement per user. Do not get too hung up on where these numbers came from but it is essential to have these business constraints to reach a reasonable conclusion. Using this information, we have the cost and benefits of our predictions per user.
By knowing the costs and benefits of our predictions, we can vary the probability threshold to maximize profit for any advertisement campaign. Each probability threshold will produce a different set of counts for the amount of true positives, true negatives, false positives and false negatives. We want to plot the probability threshold against the net revenue.
It is clear that the CatBoost model yields the highest revenue for our hypothetical marketing campaign. This aligns well with our hypothesis that the largest AUC score will be produce the best results. Even though CatBoost is a slower algorithm in terms of training time compared to logistics regression, it beats logistics regression by $150,000 in terms of total revenue. In the next section, I will discuss the importance of setting our probability threshold to be 0.16 and the meaning of it.
Takeaways and Recommendations
As mentioned above, we need to set the probability threshold at 0.16 to reach a maximum revenue for our hypothetical campaign. This means that if our model predicts a new user is to have a 16% chance or higher of making a booking, it is worth targeting this user in our campaign.
Let’s look at the feature importances from our selected CatBoost model.
The downside of using a tree-based model is we lose interpretability. Depending on which industry you are in, it is sometimes more important to know the interpretation of your model over the model performance. From the feature importances, all we know is that signup_flow, signup_method, and signup_app are important factors in predicting booking behavior.
To dive deeper, let’s look at the coefficient from the logistics regression model for high interpretability. It like it aligns well with CatBoost model’s feature importances. With this, we can have some confidences in how each feature affects user behavior.
Let’s take ‘signup_app_Web’ for example. New users signing up through the web has a 6 to 1 odd of making a booking after 5 days of signing up for an account compared to users who are not signing up through the web app with other features held constant. There must be something about the AirBnB web experience that yields a higher conversion rate. Similarly, the same analysis can be said about signup flows. Sign up flows are essentially the flow of pages the user will see from the initial encounter with the product to the sign up page. Signup flows 24, 3, and 12 tends to do well.
The Flask App
If you’ve made it this far, I am proud of you! Below is a snippet of what the Flask application looks like. If you are interested, free feel to check it out for yourself! I will leave the details on how to create the Flask App for another day. I deployed the app through Heroku, on a free hosting platform!
Another remark I should make about CatBoost is that, unlike other algorithms, CatBoost will take in features in its literal form. We do not need to one-hot-encode categorical features which help simplify the backend python script of the Flask App.
Thank you for reading!
You can find this project repository here.
You can also find this post on my LinkedIn.