Universal Machine Learning approach for targeted marketing campaigns

Building a flexible ML framework enabling marketers to create limitless ML-powered targeted campaigns.

Andrey Krivonogov

Published in

Tinyclues Vision

14 min readMar 24, 2023

This article is a high level view on a topic, all technical details are discussed in the next blog post.

When building a Data Science based product, many unexpected (complex) challenges can arise from the mix of real clients’ needs and computational constraints. At Tinyclues, we develop a platform to choose the best-targeting audience following a client’s marketing needs.

Those needs can be formulated as an offer that a client wants to promote and can vary significantly depending on the concrete use case.

Having flexibility in offer creation across a large number of clients implies important technical constraints. For example:

Clients should receive close-to-real-time results, even for complex offers (more about inference time optimization in Medium article)
Clients should be able to use our solution even for niche products with few existing data points (more on the cold-start challenge in Medium article)
Client setups can vary significantly from each other in their data infrastructure and availability, so we need to have a generic framework with as little setup and parameters as possible (more about multi-tenant setups in Medium article)

Additionally, we want to give our clients full freedom in defining offers.

For a retailer, an offer might be a specific SKU, brand, or custom collection (summer sale), whereas, for an airline, it might be a destination city or holiday period.

The following article will explore the complexity of the latter point, that is, how to predict audiences for various levels of granularity.

More specifically, we want to use an ML model to predict the propensity of a user to buy a certain offer:

Such formulation can be seen as a “transposed” Recommender System problem, that is, whereas a typical recommender system would return products for users, we want to predict users for a given offer — but what exactly constitutes an offer?

What is an offer?

In general, an offer is nothing but an SQL filter over a transaction table, which is defined by a client on the platform at the moment of campaign creation.

Imagine that Client A, a classic retailer, has sent us a table with purchase transactions with following columns:

Here, product_id is a unique identifier different for different sizes of the same product model and products can generally be grouped by brand or category.

Client A could thus define offers such as:

brand = "Adidas" OR brand = "Nike"
brand = "Adidas" OR category = "T-shirts"
brand = "Adidas" AND category = "T-shirts"
model = "Nike casual T-Shirt model 1" OR model = "Adidas sneakers model 2"

Visualizing this gives us a glimpse into the problem:

We quickly notice that there are virtually endless ways to choose an offer, and we cannot know beforehand which ones the clients will choose.

Now consider Client B, a travel company, that has sent us a table with purchase transactions but with completely different columns:

Note that there is no unique identifier for this case. We could create one artificially by combining all attributes together, but it would end up with a unique value for most transactions. Client B’s offers might include:

destination_city = "New York" AND fare_class = "business"
(origin_country = "France" AND destination_country = "USA") OR (origin_country = "Belgium" AND destination_country = "USA")
MONTH(travel_date) = "July" OR MONTH(travel_date) = "August"

As such, we can see that the term “offer” is indeed quite broad.

Whereas a generic marketing campaign for a brand like Adidas might be targeted to millions of customers, an offer for a flight ticket between two cities on a specific date targets a very selected audience of buyers.

That said, at Tinyclues, we have the ambition to build software that can be used by marketers to solve both problems with the same tool, that is to allow marketers to choose any level of granularity in how they define an offer and provide out-of-the-box predictive marketing, no matter the offer selection.

Machine Learning Design Choices

Let’s consider different ML approaches that could be used to model propensity scores in such a context and discuss their tradeoffs.

Offer-specific model (learned on the fly)

Tinyclues’ first predictive engine followed a very natural idea: to train an offer-specific binary classifier over a labeled user dataset (1 for the offer’s recent buyers, 0 otherwise). Indeed, this approach consistently represented an offer semantic and was relatively easy to implement.

Due to the vast number of offers a client can choose, we cannot precompute their scores in advance, so the training for this approach needs to happen on-demand (at the moment of campaign creation after a client chooses an offer). This also means that training becomes client-facing and should respect the short time constraint (leaving us with less time to learn a complex model or perform a model quality validation).

The main conceptual difficulty here is that in many relevant cases, the number of recent offer buyers is relatively low, which makes it difficult to build a robust classifier on top of very sparse user features (like a list of product_id previously bought by a user).

Moreover, a real cold-start scenario can’t be addressed in the same framework (we will simply have no labels to learn on).

One could solve those problems by introducing additional systems that would:

Produce low-dimensional dense user features using unsupervised algorithms.
For a given offer, find a similar offer with enough recent buyers using some similarity measure (again learned in an unsupervised manner) or offer attributes.

But in the end, such a framework becomes hard to supervise and maintain due to many independent systems. So naturally, we want to have a single end-to-end model in which we could leverage modern deep learning techniques.

Model for a subset of offer attributes

A more generalized approach compared to an offer-specific model would be a model taking attributes used in the offer’s definition as input.

For example, for client A, we can choose two offer attributes, brand and category, model’s architecture of our liking (any Recommender System Deep Learning model can be used here, like DIN, xDeepFM, …) and fit this model on a complete transaction dataset using some user features and chosen offer attributes (brand and category).

Now when the client defines an offer as a combination of those attributes, brand = "x" AND category = "y", we will apply a trained model on a user set with fixed values for offer features: “x” for brand and “y” for category.

Compared to the offer-specific models, training can be done asynchronously, so we have fewer constraints and can use more complex models and evaluate their quality.

Another advantage is the usage of offer attributes as features, which allows the model to share information between different offers. Typically, user-side features will be shared, and the model will probably learn better their embedded representation and consequently learn better user-offer interactions.

In theory, we can train such a model for any choice of offer attributes, with a total of 2ⁿᵘᵐᵇᵉʳ ᵒᶠ ᵃᵗᵗʳⁱᵇᵘᵗᵉˢ models and then, at the inference time, choose a model corresponding to the client’s choice of the offer.

But it would also mean supervising and controlling the quality of a huge number of models. Additionally, the cold-start scenario remains a weak point for this approach.

We thus found that this approach, while useful as a predictive baseline to compare with our final implementation, was infeasible in practice. To allow true flexibility in offer creation, we need a unified model that uses all available offer attributes. Doing so also greatly simplifies the training process, as we’ll only have to train one such model.

Additionally, we take advantage of full data mutualization. That is, every offer will contribute to the learning of both user and offer features’ embeddings, allowing us to make good predictions for rare offers and to have a good baseline for the cold-start scenario. But how to use this model during inference?

Aggregation of scores

Let’s get back to our hospitality use case to consider real travels booked by some customers with a defined origin, destination, date, and fare class. In inference, a client can choose any offer definition; that is, client B can choose a travel destination regardless of its origin or even date. How should one fill in missing offer features to be able to apply the model?

Naturally, we can look in the transaction table at the items bought with a given destination value and the corresponding origin and fare class values. So an offer can be seen as a union of items already purchased in the past. More formally, if we suppose that each user wants to buy only one item, then for a given offer, we can just sum up the probabilities of items belonging to it:

So, for example, when predicting for client B if a user is interested in going to New-York (offer is defined as destination_city = “New York”), we will sum probabilities for this user to buy a ticket Paris → New York economy, Paris → New York business, Berlin → New York economy, Berlin → New York business, etc.

As this approach uses all offer attributes, we can finally address the cold-start scenario by relying on higher-level attributes (like brand for client A) when making a prediction for a rare value of low-level attributes (a rare product model, for example).

Despite its mathematical precision, this approach has two major limitations when implemented in practice. First, the probability formula relies a lot on the calibration of the model, so the scores should be as similar as possible to the real probability distribution.

More importantly, the size of an offer can be large enough to make it computationally infeasible to apply a model on every item.

So, while this approach might work for small collections of items, such as a rare destination city or a retail sale on a few dozen fashion items, it quickly fails when we want to describe larger collections of items. In the example above, one can go to New York from hundreds of different cities, and this number will grow very fast when we take into account more offer attributes. In the retail case, we might have to aggregate hundreds of thousands of predictions to generate a score for a certain brand or category!

Aggregation of features’ embeddings

One can imagine several approaches to solve those challenges, but most require computationally heavy steps during inference.

We were thus seeking a form of internal aggregation of item features that would not depend on user features and, therefore, can be computed only once for all users:

The simplest example of such aggregation is averaging of features’ embeddings. More precisely, let’s consider the class of models that embed offer features independently into some vectorial space. Given an offer definition, we consider each feature as a bag of possible values it can take for items from the offer. For the example above, we will consider the following lists of values:

Then we embed each of those values independently and take the (possibly weighted) average of vectors, which gives an embedding vector for a feature we will use downstream in the model.

The resulting concatenated vector captures indeed an offer semantic in the same way a paragraph vector can be represented by the sum of the embeddings of its words, as done in a classical Word2Vec algorithm.

Sketch of the final architecture with a single model and the averaging of offer features’ embeddings

Such an approach has the following advantages:

Calculation of an embedding vector for an offer is very fast (with some pre-aggregation done in SQL, and vector averaging is a simple matrix multiplication operation — thus the computational speed doesn’t depend on the number of customers)
Embeddings we can now calculate for any offer are interesting themselves, allowing us to use similarities to help our clients better navigate their catalog.
For the inference we need to apply the model only once for each offer, keeping inference time reasonable.

Trade-offs summary

To summarize the system choices we had, let’s look again at their advantages and disadvantages:

That said, feature aggregation isn’t a silver bullet either, and depending on the model, we can get much worse predictive results than scores averaging.

To avoid such problems, we use a more complex aggregation formula that will be learned as a part of a specific model.

What could go wrong with naive feature averaging?

To evaluate if the feature averaging approach generalizes well over all offer definitions, we made a massive comparison of this model to baselines specialized on certain offer attributes (like in the second approach described above).

For many clients, we saw that a single model was performing as well as specialized models, but we also discovered some problematic cases, and the more we tested complex non-hierarchical datasets, the more examples we got where our model significantly underperformed compared to the baseline.

Mostly, those cases can be characterized by having a pair with a user feature and an offer feature that are very highly correlated. This can force a model to use essentially only this correlation, resulting in poor predictions for offers defined with other attributes.

Let’s consider an example of retail client A. We will train a simple model that will use four features (brand, category, model, and size) as offer features, and as user features, we will use purchase history (lists of brands, categories, models, and sizes that a user bought in the past).

For any given user, the model quickly deducts the user’s size preferences and will find that size is a remarkable predictor of purchases. Naturally, this can lead to two types of problems:

First, during training, the model will see a pair of highly correlated features: the item’s size and the user’s history of previously bought sizes. Naturally, it is easy to predict that any given user will only buy an item with a matching size, and the model will give huge importance to this pair of features while almost never using other ones. That means that interactions of users with brands or categories are very poorly learned.
Second, when averaging size embeddings, we might obtain a large (because of the feature’s importance) vector pointing in a “random” direction that adds a lot of noise to our prediction. Moreover, semantically this vector doesn’t make any sense when compared to the vector of the user’s size history, so calculating interaction with the user side will only amplify this noise.

Averaging size vectors results in unsuitable shift

So when the model aggregates inputs from different offer attributes, it will result in a suboptimal prediction for offers based on brand, category, or model.

One can easily imagine other examples like this:

users’ preferred stores of a retail chain are likely to correlate with their zip code
travel discount cards (youth discount, senior discount) are implicitly explained by age groups.

In general, in the majority of our clients’ datasets, there are some non-hierarchical features that can lead to similar problems.

Going back to the measures showing such problems, let’s look at how the single model with simple architecture compares to the baseline specialized only on one offer attribute when evaluating offers defined with this attribute. The metric we report is weighted macro AUC over most popular offers defined with a chosen attribute (for more details about evaluation protocol, refer to the article with full technical details and to notebooks on public datasets):

Movielens with IMDB film attributes || Rees e-commerce dataset

Tinyclues client: online retail company || Tinyclues client: a store chain

We see that in all examples, there is a substantial difference between the single model we used and the mono-feature baseline.

Solution hint

The problem exposed above is clearly a learning problem — learned embeddings are not well adapted for some offer definitions, and interaction only makes it worse. During the tests, we have seen that simply changing a model to a more complex one without taking into account embeddings averaging in the training process doesn’t have much effect: it can naturally increase AUC, but the gap with mono-feature baseline will stay the same, or even increase (more complex model can learn problematic correlation even better).

To address this issue, we have implemented a novel model architecture that learns on the data with group-by-like augmentations similar to the offers that could be seen during the inference and better takes into account average embeddings, as well as their variance. This model is described in detail in the next article, along with various benchmarks and ablation studies that we conducted to make sure this approach works as we expected.

Going back to the comparison with mono-feature baselines, we made the benchmarks against baselines with new architecture (which has, in general, better AUC than a simple model from above), and we can see that the new model allowed us to almost completely cover the performance gap seen before:

Conclusion

As we can see, our “transposed recommender system” setup brings with it some curious challenges that affect our machine learning design choices.

Providing our clients unlimited flexibility in defining their offers means that we need to adapt our ML system to guarantee high accuracy in all scenarios while respecting time constraints.

For this purpose, we designed one single model that handles all kinds of clients’ requests. This universal model automatically leverages data mutualization and, by aggregating offer features, provides consistently fast and reliable predictions.