You Live You Learn

Published in

Points San Francisco

4 min readJan 24, 2018

The title of this article is from a Canadian/American singer song-writer (the great Alanis Morissette) that fits so well for our little San Francisco/Toronto team-up to apply a bit of machine learning to our hotel sort algorithm.

Before

The early days of Points Travel hotel sort used a combination of 3rd party suppliers priority ranking, hotel attributes star rating, display average rate, number of nights, and number of points earned multiplied by different weights to calculate a score for the hotel.

Here’s a search against PointsHound for New York from 2/23–2/24:

This tends to favor hotels with high points per dollar (PPD), a favorite for true points hounds out there, and sometimes uncovered some hidden gems that had a decent three-star rating or more with a big earn bonus in the thousands for a little over a hundred bucks a night.

A/B Testing

Before we jump ahead, be sure you have a way to test your changes. An easy way to start is to pick some tool or library that supports A/B testing. We started with Optimizely, but it became cost prohibitive once we wanted to make API calls that requires the “Enterprise” license.

Fortunately, we use Rails so there’s a gem for it. We settled on a very light weight gem called field test and enhanced it a bit play nicely with the multi-tenant apartment gem.

Sanity Checks

We first ran a A/A test just to make sure the random distribution was working and then tried a few tests with our existing sort algorithms to validate that our existing “weights” algorithm was better than the 3rd party sorted results.

Getting Fancy

Our data scientists up in Toronto decided to play with our marketing data, which is a dump of all hotel searches and click stream data that is imported from AWS S3 into Vertica cluster once a day. And being data scientists, their idea of playing is following a structured processed called CRISP-DM (cross-industry standard process for data mining).

This sums up the first few weeks of data sharing between the teams…

With both teams busy on different projects and priorities, we had to find a few hours here and there over a few months on the business understanding, data understanding, and data preparation steps. Some fun things you’ll discover is standardizing distances (why is the US still on the English system when the rest of the world is metric), normalizing nightly rates and star ratings across a search to avoid skewing the data by densely populated regions, and just making sure the right data is accurately collected.

Modeling

A quick overview of the tools used:

Docker — runtime consistency across environments
Python — numPy, SciPy, Pandas, Scikit-Learn
Jupyter — browser based interactive programming notebook
Gitlab — code versioning and version tags for releases

Step 1. Data preparation (pandas) and simple calculations

“features”:{ 
 “adults”:2,
 “cost_per_star_stddev”:0.0799,
 “days_to_checkin”:31,
 “leisure”:1,
 “nights”:1,
 “ppd”:2.7701,
 “priority”:1.0324,
 “refundable”:false,
 “to_airport_km”:4.6,
 “to_city_km”:17.6,
 “total_cost_stddev”:-0.3921,
 “travelers”:2
 }"target": ["0 - not booked", "1 - booked"]

Step 2: Split into Train (67%) and Test (33%) Samples

A majority of the data is used to train the model, and a subset is retained to test the model for accuracy. Here are some nice visuals on what under-fitting and over-fitting looks like. We’re looking for the Cinderella graph of “Just right!”.

Step 3: Decision Tree Algorithms

There’s a host of algorithms: CHAID, C&RT, C5, Bagging and Boosting which is far beyond this post. GridSearch was applied to select the best model parameters and then run with SciKit learn to generate a decision tree.

Step 4: Evaluate the Model Performance

Step 5: Publish a PMML (Predictive Model Markup Language)

RESTful API’s have JSON, so machine learning models have PMML that allows you to model using your favorite language tools (python/scikit) and then execute that model in another environment (rails).

Step 6. Deploy into Production

Our rails application uses a scoruby gem (the maintainer is super nice!) to read the PMML and do real-time scoring of the results. Each hotel is scored and then sorted by the result against the entire search.

decision_tree = Scoruby.load_model 'decision_tree.pmml'
features = { f1 : v1, ... } 
decision_tree.decide(features)

=> #<Decision:0x007fc232384180 @score="0", @score_distribution={"0"=>"0.999615579933873", "1"=>"0.000384420066126561"}>

Final Result

The weights made sense in isolation, but the machine learning on actual conversion data finds a wider range of hotels that wouldn’t have normally been discovered in a result set of 500 hotels. As always, we’ll continue to live, to learn, to love, and to learn…