Getting into the Consumer’s Mind: Predictive Modelling with Contextual Data

Published in

AirAsia MOVE Tech Blog

6 min readOct 31, 2021

by Disha Mendiratta

Close to 20 years ago, the sci-fi action flick Minority Report starring Tom Cruise captivated film-goers with the notion of predicting criminal activity before it happened. Today, the quest to understand the unknown remains, fuelled by exciting applications of data science and machine learning.

What does this mean for digital businesses and mobile-first consumers today?

Imagine this. In the past you have always watched Action movies like Tom Cruise’s Mission Impossible, but today you are watching a movie with your Mom and you search for some family-based comedy movies. Would you like to see some movie recommendations based on your recent searches? Or would you be happy spending time finding a good movie?

Have you ever seen Instagram Explore / Reels ? Once you open reels and start scrolling, you’ll find more relevant content out of billions of options. Think once, would you prefer to watch Reels based on what you just tapped on or something you were interested in 10 days back ?

The most relevant content is thus a blend of recommendations based on historical interests & choices, geo-locations, supply & demand, and fresh content.

The above are examples of real time predictions, where based on the user’s live interactions, content is recommended.

Real Time Predictions thus have multiple use cases. This article focuses on how we, at Airasia superapp, display the most suitable content to those who engage with our platform.

Problem

Airasia superapp offers multiple products from flights and hotels to home delivery of food and beauty products. We have carousels for each one of these on our homepage. A static order of carousels will not be able to serve the differing needs of the millions of users who visit our platform everyday.

A need thus arises for personalisation to show the Line of Business (LOB) carousels in the most relevant order, and recommend the right content to the user at the right time.

A user’s intent in visiting the website based on historical bookings across all LOBs is computed using the User Behaviour Model. These are batch predictions, where the predictions are generated from a batch of historical data and stored in a database and further scheduled to run on a daily, weekly and quarterly basis.

The thing with batch predictions is that we end up missing the real time interaction and engagement of users with our website that might indicate the user’s purpose better. Thus, we have taken such interactions into account as real-time inputs for the User Context-Based Model.

The carousels on the homepage are ordered based on the output scores from this model which indicate the user’s interest in each LOB.

Carousels on the airasia homepage: What service will you use today?

Features

Features such as device capabilities (desktop/app), website interactions (which LOB the user clicked on, how many different locations the user searched for), time of the day the user visits the website, and country from which the user has logged into, give strong signals about a user’s current preference and help in making better recommendations. After Feature Engineering, the data is then used for training the model.

Modeling

As mentioned earlier, batch predictions (offline) are based on historical batch data and further scores are stored in a database. Whereas in a real-time model (online), predictions are made based on the inputs received at inference time. This creates complexities in the overall system, as the features have to be inferred then pre-processed to be in the same format as at the time of training before the model can provide an output. The model is exposed as a REST API.

When we choose models for final deployment, accuracy is our top priority. There is another important factor which comes into play when we talk about deployments of ML Model in production which is Latency. It refers to your delay between your action and the web application’s response to that action. As mentioned earlier, when we use batch predictions, the predictions are done offline and are stored in some database, thus here Latency is not much of priority and we can fully focus on the accuracy of the model.

Whereas in the live model, latency is a key metric. Understand it this way, if the website load time is increased because of making recommendations, then there is a high chance that the user experience will be bad irrespective of how good or bad the recommendation is. To avoid such a situation, a model capable of making fast predictions is required. Thus, we chose to power this recommendation engine using a lightweight model, Logistic Regression.

The problem statement is to predict which LOB a particular user is more likely to be interested in. We have multiple LOBs, thus it is a multi-class classification problem. We split the problem into multiple binary classification problems, and fit a standard logistic regression model on each sub-problem. A “One-vs-All” technique is applied, for which one model for every class has to be trained.

The training data was highly imbalanced, as the new LOBs such as hotels, and deals had fewer clicks as compared to flights. If this data is trained, the model will be biased towards the majority class — Flights, and would not perform well on minority classes — Hotels, Deals, etc. This in turn would lead to mis-classification errors for minority classes, reducing the model predictability. The imbalance in the data can be treated with various techniques, like generating synthetic data, resampling or trying different algorithms. This issue of imbalance in the data is fixed using model hyper-parameters. Although logistic regression does not directly support imbalanced data, there is a class_weight argument which can be set as balanced. By default, this is None which means both the classes have equal probabilities. By setting as Balanced, weight inversely proportional to the frequency of classes is computed.

Consider the model for prediction of the user’s interest towards the Hotel LOB. The model would predict Hotels vs [Flights, Deals, Shop, etc. ].

When we deal with imbalanced data, f1-Score is the best evaluation metric.

f1-Score is the harmonic mean of precision and recall

So, we prepared a dataset, handled the imbalance in the dataset, and built the multiple models, but still the question is, which LOB is the user more interested in? To answer this question we get the feature coefficients for each model.

Here are sample features with coefficients from a trained Model for predicting user’s interest in Hotels LOB:

Interpret the coefficients this way :

if the user searched for domestic flight (searched_domestic is a categorical feature) then the output is

-0.037*1 = -0.037.

if the user searched on mobile app(device_platform_mobileapp is a categorical feature) then the output is

1.7*1 = 1.7

Consider the scenario of a user who searched for two different domestic flight routes on Friday via desktop from Thailand.

If we compute all the effects and add them up we have -0.03 (Searched Domestic = Yes) + 1.7 (Mobile App = Yes) + 0.24 (Searched on Friday) — 0.22 (Thailand = Yes) + 0.48 (0.24 *2; Two Distinct Routes) = 2.17.

We then need to add the (Intercept), also called the constant, which gives us 2.17–1.53 = 0.64 ( score).

Here, the score (0.64) is x.

Thus, Probability = 1 /(1 + exp(- 0.64)) = 0.65 = 65%

The user thus has a 65% chance of being interested in Hotels.

Using this method, the probability for interest in other LOBs is also computed, and the carousels on Homepage are ranked accordingly.

The weights for each feature for all the lobs are stored offline. The real time inputs are pre-processed to be in the same format as when trained. The stored weights are further multiplied with real time inputs inferred from the users, and further score is computed. Based on the scores, the LOB Carousels are ranked.

This is one of the many projects that we’ve embarked on to improve the user experience of our customers on our super app. Stay tuned for more!

Getting into the Consumer’s Mind: Predictive Modelling with Contextual Data

Problem

Features

Modeling

Written by Disha Mendiratta