Machine Learning ecosystem @Travel Triangle

Bipin Deep Singh
TravelTriangle

--

Travel Triangle (TT) is an online holiday marketplace that connects travelers to multiple local travel agents to help them create a customized and memorable holiday.

To facilitate this tech and analytics teams in TT have worked over the last few years to build a scalable data pipeline — collecting data from various sources, validating it, and bringing it all together in one place for various analyses (Detailed blog here). The next step was to build models on top of this data to help business teams make better and faster decisions.

Our key focus areas in building a robust machine learning ecosystem are

  1. Model deployment: gluing models into existing production systems efficiently
  2. Automating repetitive tasks: having a framework in place that allows the data science team to work in a fail-fast manner

This blog will take you through a few of the solutions developed as part of building this ecosystem

Model Deployment

Our major challenge was how to deploy models in real-time. For models to add value to the business, they need to help with decision-making in real-time. We solved this in 2 stages

Stage I

Our initial models used logistic regression to solve binary classification problems, to deploy these models we made use of rule-engine built by our tech team. This helped us launch initial models quickly without building any new framework

Some use cases solved using this framework were

  • Identifying leads more likely to convert at the time of lead creation to segregate workable vs non-workable leads
  • how likely is a customer to answer our agents’ calls at any given moment

Limitations of this solutions were

  • Rule engine can deploy only equation-based models, limiting our solutions to regression-based models
  • This solution requires the tech team to code each feature in the model individually resulting in longer deployment times.

Stage II

To overcome the limitations posed by the rule-engine based Stage I solution, we built a custom framework to deploy models in real-time.

One use case was dynamic lead scoring which had to be done on a real-time basis for us to help agents to prioritize their leads effectively. This production-ready system was deployed where lead scores are updated in real-time with every event update.

This solution was built using EMR and with event updates received on Kafka. Lead data till the previous day is extracted from the ETL layer present in Redshift and event updates from Kafka are used to update the features in real-time. Pre-trained models present in an S3 bucket are used to recalculate the lead score and is pushed to DB via Kafka event. This framework removed the restriction of using equation-based models imposed by the stage I solution

After evaluating data variance across multiple days 4 gradient boosted trees for conversion optimization were selected:

  • D0 Model — With 0 interaction data optimized for static trip features, for early prioritizing
  • D1-D2 — Initial user interaction to enable ops do better ranking at initial stages
  • D3-D4 — With some feedback from the customers by D3, higher precision can be achieved
  • D5-D6-D7 — Significant customer information has come into the system by D5, and with rich datasets it leads to high precision & high recall

Ensemble of multiple gradient boosted trees is further fine-tuned to enhance performance
Shallow neural network at combinatorics level combines results of gradient boosted model based on # days since trip activated, 2 hidden layers with sigmoid activation function. Negative feedback fine-tuned boosted trees on the basis final conversion results

Building models the fail fast way

We are always looking for ways to improve the workflow of data science team. To ensure the team can work on as many problems as possible, we needed a framework that can build and test initial versions of models quickly to identify if the problems are worth solving.

We tested a few third-party tools for this, key among which was AutoML by Google. Some of the issues we faced with this were

  • solutions for modeling structured data sets is still in beta phase
  • long model build times (it took 14 hrs to train a model using a 300 MB .csv data)
  • Adding new features to the model was very time consuming

Since most readily available solutions weren’t in line with our fail fast ideology we built a library in R which quickly churns out classification models (most problems currently being worked upon require classification models), doesn’t require extensive background knowledge of the modeling techniques and takes care of basic good practices of building stable models.

Our solution took 10 mins compared to 14 hrs on AutoML to build models on same data set with comparable accuracy, this gave us flexibility to run multiple variations in the same amount of time.

In order to test the effectiveness of this approach, we have started with functions to build logistic regression models. Future versions of this library will include more classification techniques.

To use this, all the user has to do is — collect all the variables and ensure basic data sanity checks (check for missing values and take care of extreme values). Since logistic regression models don’t accept categorical dependent variables, they should be one hot encoded eg. variable gender with two levels — male and female can be replaced with one variable gender_male with values 1 where it’s male and 0 where it’s female.

The functions in the library automate following steps of modeling

  1. Removes variables with 0 standard deviation
  2. Removes multicollinearity from data: thresholds for acceptable multicollinearity are defined by specifying maximum acceptable VIF (variance inflation factor). Default is set to 10
  3. Variable selection using backward selection process using variable significance. The default threshold is set to p-value≤0.05 but can be modified by the user
  4. Selects thresholds using the ROC curve and create confusion matrices for test and control data sets to compare the accuracy

There is one overarching function which does all the above steps and outputs model object, confusion matrix, a data frame comparing test and train accuracy parameters and a data frame containing the list of variables removed during the selection process

Two more functions are available. One of them returns a data frame after removing the correlated variables. The other one builds models directly if the user has data ready to go.

Currently, the library can support only logistic regression. In order to make it more versatile, we plan on adding more algos for both classification and predicting continuous data along with options to build ensemble models

When you deliberately build an open, collaborative culture, where people are inspired by and are working towards some common goals, amazing things will happen. — Suzi Edwards-Alexander, Global Head of Recruiting at ThoughtWorks

We at Travel Triangle strongly believe in and are passionate about contributing to open source. Following this philosophy the library can be downloaded from git and will be available on CRAN soon.

Next steps

  • Improving our model deployment framework to reduce turnaround time and latency
  • Add more features in the existing library and build additional tools to automate repetitive work
  • We have started working on building our data lake to streamline future data needs

--

--