Optimizing Marketing Strategies using A/B Testing with Machine Learning (Part 2)

Published in

Novo Nordisk — AI and Analytics Centre of Excellence

11 min readAug 24, 2022

This is a continuation of article (Optimizing Marketing Strategies using A/B Testing with Machine Learning (Part 1)). It is highly recommended to read that article before this one.

The Machine Learning Approach for Uplift Modelling: Which Customers should be Targeted?

It is important to note that a good uplift model, tries to estimate the “increase in probability of responding” of a customer being targeting instead of “the probability of responding”.

Specific analytical tools are needed to find and predict the change in customer behaviour as a result of being targeting as the same customers can not be both targeted and non-targeted — and therefore you can not directly compare the results for specific customers. You therefore rely on the overall similarity of customers in your target and non-target groups in a statistical sense. There are several different ways of modelling this with Machine Learning approaches.

The Two-Model Method
Lo’s and Lai’s Methods
Advanced Tree- and Ensemble based Methods

The goal is to be able to predict the true uplift for customers when targeted based on their characteristics such as age, purchase history, location etc. so you can rank your customers from most to least profitable when it comes to selecting marketing targets.

The best approach normally varies from use case to use case, so it is often necessary to experiment with several different approaches and pick the best one. In general, the advanced tree/ensemble methods have shown to be slightly more accurate — but also more difficult to implement — however it is not always the case.

Two-Model Approach

The two-model approach is also called the “naïve” approach. You simply build two machine learning models and combine them to predict the true targeting uplift. Any supervised model can be used — one for estimating the response rate / sales uplift for the targeted group (Mₜ) (using only the targeted customers) and one for the untargeted group (Mₙₜ) (using only the non-targeted customers). You then use the customer characteristics (e.g., location, age, purchase history, profession etc.) as model features and the A/B test results for response rate / sales uplift as target values and just like that you have two different models. The prediction of the “true uplift” (U) for a specific customer cₓ is then the difference between the two prediction results for that customer:

U = Mₜ(cₓ) -Mₙₜ(cₓ)

This is in fact the uplift prediction in the model for targeted customers minus the uplift prediction in the model for non-targeted customers. The benefit of this approach is the simplicity but it also comes with several drawbacks. First each of the models only consider data for their specific subgroup of the data (target and non-target customers). This means that the models are optimized for estimating response rates for two separate groups of customers individually and they are not built directly with the purpose of estimating true uplift. Secondly, both models need to be very accurate as the errors and uncertainties in the models are amplified by taking the difference in the prediction results.

A more robust and consistent way of doing uplift modelling would be to build a single machine learning model with the purpose of predicting the true uplift directly — using a combined dataset of both targeted and non-targeted customers.

Lo’s and Lai’s Methods

As mentioned in the previous section, it would be beneficial to define one single machine learning model build directly for predicting true uplift using the full dataset of both targeted and non-targeted customers. This section will introduce two different approaches to this problem — Lo´s and Lai´s methods (Lo et al. (2002) and Lai et al. (2006)).

Lo´s Method:

This method takes a very direct approach to estimating the uplift by using machine learning and the introduction of a dummy variable (t) which indicates if the customer is in the treatment (1) or control group (0) — in addition to the standard customer features to predict if customer responded (y=1) or not (y=0) as illustrated in the table below.

The method was originally developed to be used with logistic regression but can be generalized to any supervised learning algorithm. This model allows for a straight-forward calculation of the “true uplift” (U) for a customer as the difference in predicted response probability (or sales uplift) for t=1 and t=0 using the same machine learning model (M):

U = M(t=1) -M(t=0)

This gives you a direct measure of which customers have the highest predicted values for “true uplift”. It should be strongly considered to target the customer with the highest predicted values and not target the customers with the lowest values unless there are good reasons to do otherwise. This prioritization of customers has the potential to boost your sales without increasing the spend on marketing efforts by focusing the efforts on the most promising customers.

In some cases, the treatment variable (t) is not being picked up by the machine learning algorithm and it can be due to two very different reason 1) it is truly not significant and your marketing efforts are not bringing any value, 2) its highly correlated with other parameters in your model. In the second case you might want to exclude some of those highly correlated parameters.

In Lo´s approach an additional set of features x*t are sometimes included in the model as well. It is simply all your normal features multiplied by t (which can be 0 or 1 depending on if the customer is being “treated” or not). This allows for the model to more directly pick out effects related to the uplift of treated customers.

Lai´s Method:

This method ties back to the four different types of customers defined in the beginning of this article — “Sure Things”, “Lost Causes”, “Persuadables” and “Do-Not-Disturbs” — with the aim of identifying “Persuadables” as this is the only group of customers benefiting from the marketing efforts.

The approach for doing that is first to split the customers into four different groups (g) based on what was actually observed in the A/B test:

Control Responders (CR): Non-target customers responding
Control Non-Responders (CN): Non-target customers not responding
Treatment Responders (TR): Target customers responding
Treatment Non-Responders (TN): Target customers not responding

The standard Lai’s approach is to define TR and CN as good targets (y’ = 1) because these two groups together contain all “Persuadables” and do not contain any “Do-Not-Disturbs”. In the same way TN and CT are defined as bad targets (y’ = 0) as illustrated in the table below.

The problem has then been reduced to creating a binary classifier (M) to predict good or bad marketing targets. Alternatively, you can use a machine learning algorithms (M) to directly estimate probabilities for each of the four groups — TR, CN, TN, and CT. The uplift score (U) can then be defined as:

U = M(TR) + M(CN) -M(TN)-M(CR) = M(good) -M(bad)

These methods are more complicated to implement if your target variable (y) is continuous as you have to define a threshold for positive response (corresponding to y=1) and non-response (corresponding to y=0). The approach is also not optimal in the sense that it does not exclude “Lost Causes” and “Sure things” from the good target variable.

If the size of the control and treatment groups are different, you should instead investigate the generalized Lai’s approach as introduced by Kane et al. (2014).

Advanced Tree- and Ensemble based Methods

There is a whole family of tree- and ensemble-based methods designed to tackle this problem as well. A deep dive into this field is outside the scope of this article but might be addressed in another article in the future. Some examples are significance-based uplift trees (Radcliffe and Surry (2011), Hansotia and Rukstales (2002), Chickering and Heckerman (2000)), divergence-based uplift trees (Rzepakowski and Jaroszewicz (2012), Soltys and Rzepakowski (2015)), uplift random forests (Guelsman et al. (2012), Guelsman et al. (2014) and uplift bagging (Soltys et al. (2015)). Many of the advanced algorithms have been implemented in R (e.g., check out CRAN — Package tools4uplift (r-project.org) from Belbahri et al. (2019)).

Evaluation of the Uplift Models

In this section we will go through different ways to evaluate and visualize the results of the uplift model.

Uplift by Decile Graph

The uplift by decile graph is a popular visualization to get an idea about the quality of your uplift model. All the observations/customers in both the treatment and control group are combined, scored with the uplift model and then ranked from high to low predicted uplift in deciles. The next step is to calculate the average actual increased sales/response rate in the A/B test for both the treatment and control group in each of the deciles. A good uplift result is if the actual measured increase in the treatment group is higher than in the control group as this means that the marketing campaign is working. A good model result is to have a bigger difference between actual increase in sales/response rates for treatment and control group (true uplift) in the left part of the bottom figure below, as these deciles have the best target customers for treatment according to the model and therefore should experience the highest increase compared to not being targeted. As there might be “Do-Not-Disturbs” among your customers you can get negative true uplift values for the low scoring customers. The figure below illustrates a promising uplift model.

It should be noted, that for many use cases the ranking of customers is much more important than the actual uplift values predicted by the model.

Qini Curve

The Qini curve is constructed by first sorting the customers from highest to lowest predicted uplift from the uplift model (as in the top figure below). Then the accumulated uplift (lower figure below) is calculated simulating the actual sales uplift if you start your marketing strategy by starting with the most promising customers and ending with the least promising customers at 100 % of customers reached. The results can easily be compared to a baseline of random selection of target customers where you do not start by targeting the most promising customers. If the cost of targeting customers with the marketing is known, a cut-off can easily be defined for “good targets” where the sales uplift of the customers is higher than the marketing costs as also illustrated below:

The Qini curve gives you a good picture of the additional sales you can achieve by selecting target customers using the uplift model compared to a random selection of target customers. The baseline can also be redefined as the accumulated sales uplift from the existing targeting strategy. The Qini curve also tells you how many customers it makes sense to target before your marketing expenses will be higher than your increase in sales, e.g., around 25 % of the customers in the figure above.

The Qini curve can also be used to calculate the “Qini measure” which is a model performance metric calculated by the area between the Qini curve and the baseline. It tells you how much better you can do by targeting your customers according to your uplift model than by targeting your customer randomly or using the existing marketing strategy in your team.

A/B Testing

The most robust and straight-forward method to validate your uplift modelling results is simply to do another A/B test on your customers. Let group “A” be customers where you use your current targeting strategy or random targeting (depending on what is your baseline) and group “B” be customers where the suggested targeting strategy from the uplift model is being used — and compare the results. This is of course more time consuming and will include an additional cost but there are many benefits to this approach. If the A/B test is a success it is one of the strongest possible arguments that the model results can be trusted and is definitively a breaking argument with a stubborn stake holder not believing in the models. It is simply hard to argue with a well performed A/B testing experiment if the target/non-target split is truly random of course!

A/B testing is the most trustworthy method to compare your existing marketing strategy with the strategy suggested by the uplift models.

What if you already have an effective Marketing Strategy?

If you already have an effective marketing strategy for targeting your customers, it is also possible to construct your A/B testing in a slightly different way. Again, we randomly split the customers into group “A” and “B” with customers in group “A” not being targeted by the marketing. For group “B” you can then use your existing marketing strategy as usual and calculate the true uplift after the A/B testing. This will give you a direct estimate of the actual sales uplift from your targeting and by applying the same Machine Learning approaches on top, the algorithms can learn how to further improve your existing marketing strategy. It will simply suggest corrections to your already effective marketing strategy to make it even more effective.

Conclusions

This article explained how to utilize A/B testing and Machine Learning techniques to optimize your marketing strategies to increase sales without wasting valuable marketing resources on the wrong customers. It identified one segment of customers -“Persuadables”- benefiting from your marketing efforts and three segments not brining any value. Using A/B testing is a powerful technique to assess if a marketing campaign will be a success without doing a full investment on your full (potential) customer base. Machine Learning added on top will learn from your A/B test which kind of customers have the highest sales uplift when targeted and can be used to rank and prioritize your customers from most to least promising marketing targets. This will help you spending your marketing resources on the right customers and boost your sales numbers.

The downside of A/B testing, and uplift modelling is that a relatively elaborate experimental design needs to be established to gather the required data and validate performance. Is it worth it? You decide.

Literature List

Kane, Kathleen & Lo, Victor & Zheng, Jane. (2014). “Mining for the truly responsive customers and prospects using true-lift modelling: Comparison of new and existing methods”. Journal of Marketing Analytics. 2.10.1057/jma.2014.18.

Lo, Victor. (2002). “The True Lift Model — A Novel Data Mining Approach to Response Modelling in Database Marketing”. SIGKDD Explorations. 4. 78–86.

Lai, Yi-Ting & Wang, Ke & Ling, Daymond & Shi, Hua & Zhang, Jason. (2006). “Direct Marketing When There Are Voluntary Buyers”. IEEE Computer Society. 10.1109/ICDM.2006.54

Verbeke, Wouter & Baesens, Bart & Bravo, Christian. (2018). “Profit-Driven Business Analytics”. John Wiley & Sons.

Radcliffe, N.J. and Surry, P.D. (2011). “Real-World Uplift Modelling with Significance-Based Uplift Trees”, Stochastic Solutions, section 6.

Hasotia, B. and Rukstales, B. (2002). “Incremental Value Modelling”. Journal of Interactive Marketing, 16(3): 35–46

Chickering, D.M. and Heckerman, D. (2000). “A Decision Theoretic Approach to Targeted Advertising”. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence.

Rzepakowski, P. and Jaroszewicz, S. (2012). “Decision Trees for Uplift Modelling”. Data Mining (ICDM).

Soltys, M., Jaroszewics, S. and Rzepakowski, P. (2015). “Ensemble Methods for Uplift Modelling”. Data Mining and Knowledge Discovery, 29(6): 1531–1559.

Guelsmann, L., Guillen, M. and Perez-Marin, A.M. (2012). “Random Forests for Uplift Modelling: An Insurance Company Retention Case”, Lecture Notes in Business Information Processing 115 LNBIP, 123–133.

Guelsmann, L., Guillen, M. and Perez-Marin, A.M. (2014). “Optimal Personalized Treatment Rules for Marketing Interventions: A Review of Methods, a New Proposal, and an Insurance Case Study”

Radcliffe, N.J. (2007). “Using Control Groups to Target on Predicted Lift: Building and Assessing Uplift Model”. Direct Marketing Analytics Journal, 3:14–21.

Belbahri M, Murua A, Gandouet O, Partovi Nia V (2019). “Qini-based Uplift Regression.” arXiv preprint arXiv:1911.12474.