Simple Machine Learning Techniques To Improve Your Marketing Strategy: Demystifying Uplift Models
In this article, we will demonstrate how the use of simple uplift models can drastically improve your targeted marketing strategy
Have you recently received a discount coupon for a new laundry detergent brand that you have never tried before? Or have you notice more apparel advertisements on your Facebook page after browsing the latest Autumn collection on H&M’s webpage? If you answered yes to any of these questions, you are likely the recipient of a targeted marketing campaign.
You might wonder how these marketing campaigns identify the right individuals to send discount coupons or personalized advertisements? Certainly, there is a cost in running these campaigns, and it is unfeasible to send discount coupons to everyone in the country. The cost of printing these coupons and the lost of potential revenues as a result of offering the discount, might outweigh the additional revenues generated from new customers.
In addition, what if the customer was already planning to buy the new laundry detergent before receiving the coupon. In this case, the retailer will generate less profit, since the customer will be paying less than what he or she already intended to pay.
So how do you identify individuals who are only likely to purchase your product after receiving your promotional coupon, but would not have done so otherwise?
The solution to this problem is uplift modelling.
So what is uplift modeling? Wikipedia describes it wonderfully, “uplift modelling, also known as incremental modelling, true lift modelling, or net modelling is a predictive modelling technique that directly models the incremental impact of a treatment (such as a direct marketing action) on an individual’s behavior”
In other words, uplift modeling can help you identify individuals who will purchase your products only as a result of receiving a discount coupon or a personalized advertisement. Utilizing these models can help your firm maximize profits by keeping advertising cost to the minimum.
Furthermore, it helps your firm to avoid losing business from “sleeping dogs” customers. “Sleeping dogs” are individuals who buy your product, but will stop doing so if they are included in your marketing campaign.
For example, imagine that you are running a departmental store, and the maternity business is a lucrative business for you. Your store might want to predict which of your customers are likely to be pregnant so that you can send her promotional coupons for baby products. However, there is always the possibility that the targeted customer will sense a lack of privacy and respond negatively to your marketing campaign. In that case, you will lose another valuable customer.
While uplift models can be really beneficial, they can sometimes be tricky to implement. Notably, the biggest challenge is finding the optimal method to model the incremental impact of the treatment on the individual’s response. Thankfully, there exists relatively simple models that can help you right away.
Having laid out the benefits of uplift modeling, let us explore some common uplift models that can help you enhance the effectiveness of your marketing strategy. So stay tune!
Note: The code accompanying this article can be viewed here.
What Kind of Data Do You Need
Before we can perform any modelling, we will need to acquire an accurate dataset that keeps track of the following features:
- Whether a person received the treatment. These treatments can be a coupon or a mailed advertisement, etc..
- Whether a person purchased the product.
- Any other information that you believe would be beneficial for the modelling process. These can be age, income, occupation, etc..
Of course, the acquisition of such data is no easy task, but assuming you do have these data, there are some simple uplift models that you can use to enhance your marketing strategy.
Note: Uplift models can also be extended to multiple treatment groups, but this is a more complex scenario that we will not be discussing today.
For this article, we will be using data from an old take-home assignment from Starbucks given to their job candidates. This is a relatively simple example, but it is perfectly suited for our purpose.
This dataset concerns an experiment involving a promotional campaign. As part of the experiment, some customers were given promotions to entice them to purchase a product. Each product has a purchase price of $10, and the cost of each promotion is $0.15. Ideally, it would be best to limit that promotion only to those that are most receptive to the promotion.
Any customers who received a promotion will be classified as belonging to the treatment group. Note that the term treatment and experimental are often used interchangeably.
A second group of customers, known as the control group, were not given the promotions. In both groups, the dataset also tracked if the customers ended up purchasing the product or not.
In addition, seven additional unnamed features, V1 … V7, associated with each data point were provided as well.
There are a total of 120,000 data points in this data set. Although there is the possibility that a single individual could be represented by multiple data points, let us assume that each data point represents a single individual for simplicity sake.
Approximately 2/3 of the data will be assigned for training, and the rest will be set aside for testing the models.
An important thing to note for this example is that the classes are highly imbalanced. The number of data points with no purchases is approximately 80 times higher than the number of data points with purchases. Hence, you will need to apply techniques to handle such imbalances. If you choose not to do so, the machine learning model is likely to predict everyone to have make no purchases.
If you like to read up on these techniques, here are some great articles: 7 Techniques to Handle Imbalanced Data and SMOTE explained for noobs — Synthetic Minority Over-sampling TEchnique line by line.
We will be up-sampling the minority classes in the training data with SMOTE for simplicity sake. This will result in equivalent number of data points for each class. Ideally, up-sampling should be performed only on your training data as it is preferable to have your validation data mimic the test data. In the real world, it is very likely that only the minority of the individuals under study will actually make a purchase.
SMOTE allows us to create new observations with slightly different feature values from the original observations. This is often a better approach than just resampling the original data, which will create too many duplicated data points and lead to over-fitting in the machine learning model. If you are interested in how SMOTE works, this article describes it brilliantly.
Note: If you are interested in knowing the source for this dataset, it is provided by Udacity as part of their Data Scientist Nanodegree. The dataset can also be found in my GitHub repository, so feel free to try it out!
We will be using 2 key metrics to track the performance of our models:
Incremental Response Rate (IRR):
- IRR measures how many more customers purchased the product with the promotion, as compared to if they didn’t receive the promotion.
- Mathematically, it’s the ratio of the number of purchasers in the promotion group to the total number of customers in the purchasers group (treatment) minus the ratio of the number of purchasers in the non-promotional group to the total number of customers in the non-promotional group (control).
Net Incremental Revenue (NIR):
- NIR measures how much is made (or lost) by sending out the promotion.
- Mathematically, this is 10 times the total number of purchasers that received the promotion minus 0.15 times the number of promotions sent out, minus 10 times the number of purchasers who were not given the promotion.
To demonstrate that simply sending a promotion to everyone isn’t a viable strategy, let us test out this approach on the Starbuck’s dataset and use the resulting scores as our baseline model. It turns out that employing such a strategy will produce an IRR of 0.96%, but an NIR of -$1,132.20! Now, that is bad marketing!
For this assignment, Starbucks claimed to have a model that achieved an IRR of 1.88% and a NLR of $189.45 (albeit this assignment is outdated, and it is likely that Starbucks will have a better model now). These numbers will serve as the benchmarks to beat, so without further ado, let’s begin with the modeling!
Model #1: Traditional Approach
The traditional approach to uplift modeling is training a predictive model on only the treatment group (those who received the promotion in our case). This model will separate those who are likely to respond (purchase the product) from those who are less likely to respond (did not purchase the product).
Since only the treated customers will be modeled, this approach seems to be eschewing potentially useful data. After all, we did not collect data for the control group (those who did not received the promotion) just so that we can cast them aside during the modeling phase.
Hence, let us modify this approach slightly. We will train the model on the entire dataset, but keep the original task of identifying customers who will make a purchase only after they were given a promotion.
We can do this by assigning labels of 1 to individuals that received a promotion and made a purchase, and labels of 0 to everyone else. Pretty easy right!
Shown below is a simple way to do so. It may not be the most efficient way, but it is readable and serves the purpose.
# only those who made a purchase after receiving a promotion will be # assigned a label of 1, while the other individuals will be given a # label of 0response = for index, row in train_data.iterrows():
if (row['purchase'] == 1) and (row['Promotion']=='Yes'):
train_data['response'] = response
Next, we will create a validation dataset from the original training dataset. Just as a reminder, the new training dataset should not contain any of the data found in the validation data. You can use the validation dataset to fine tune your model through parameter grid search, a concept known as cross-validation.
As noted previously, we will only up-sample the training data, and not the validation and test dataset. This will help balance the number of data points for each class.
# up sample only the train dataset with SMOTEsm = SMOTE(random_state=42, ratio = 1.0)
X_train_upsamp, Y_train_upsamp = sm.fit_sample(X_train, Y_train)
X_train_upsamp = pd.DataFrame(X_train_upsamp, columns=features)Y_train_upsamp = pd.Series(Y_train_upsamp)
Now, we are ready to feed the data into a machine learning model. The model used for this example is the XGBoost Classifier, a relatively popular model often used in machine learning competitions. Of course, you should try other machine learning models as well to see which is the most suitable.
In case, you are wondering, the features for the model are the unnamed V1, V2, …, V7 features.
# X_train_upsamp contains features V1, ... , V7
# Y_train_upsamp are the labelseval_set = [(X_train_upsamp, Y_train_upsamp), (X_valid, Y_valid)]model = xgb.XGBClassifier(learning_rate = 0.1,\
max_depth = 7,\
min_child_weight = 5,\
objective = 'binary:logistic',\
seed = 42,\
gamma = 0.1,\
silent = True)model.fit(X_train_upsamp, Y_train_upsamp, eval_set=eval_set,\
eval_metric="auc", verbose=True, early_stopping_rounds=30)
So how do we predict whether a new individual should receive the promotion?
If the model predicts a label of 1 for that individual, then it is likely that the individual will respond favorably to the promotion campaign and we should send him or her the promotion. Otherwise, we should not send a promotion.
The results for this simple model is remarkable! The model achieved an IRR of 2.19% and an NIR of $332.70 on the test data. These numbers actually outperformed the Starbucks’ model (IRR of 1.88% and a NIR of $189.45). We should never underestimate the capabilities of simple models.
The results obtained serves as a perfect example of how the application of uplift modeling in your marketing strategy, however simple it may be, can drastically improve your bottom line as compared to heuristic approaches.
Note: The code accompanying this article can be viewed here.
Model #2: Two Models Approach
The two model approach is commonly described in uplift modeling literature. It is a simple and intuitive approach. Two separate models are trained: a control model and a treatment model.
The control model, which is trained only on the control data (individuals who did not receive the promotion), will predict how likely an individual will make a purchase without the influence of the treatment (receive no promotion).
The treatment model, which is trained only on the treatment data (individuals who received the promotion), will predict how likely an individual will make a purchase under the influence of the treatment (receive promotion).
Ideally, the difference in the predicted probabilities of the two models will indicate whether sending a promotion will increase an individual’s likelihood of making a purchase. We will refer to this difference in probabilities as the lift, a term commonly found in uplift modeling literature.
We can then select a cut-off percentile of the lift values, either through a grid search or manual selection, to identify which individuals we should send the promotions. For example, we can choose to send the promotions only to individuals with lift values in the top 5 percentiles.
The code for this model is relatively similar to the first model, except that we will be training two models on different datasets: one for the control group and the other for the treatment group. Remember to up-sample both the treatment and control training data separately as both groups are likely to have different characteristics.
Using an XGBoost model, the performance of this approach on the test dataset is less spectacular. If we were to send out promotions only to individuals with lift values in the top 3 percentiles, we will have achieved an IRR of 1.76% and a NIR of $12.70. Nevertheless, these numbers are far better than what we would have achieved if promotions were sent out to every single customers. As previously noted, such an approach would have yielded us an NIR of -$1,132.20.
There are a few drawbacks to the two models approach. Victor Lo, who made numerous publications in the field of uplift modeling, addressed these drawbacks clearly. The two model approach models lift indirectly. While both models might accurately model the probabilities of response in either the treatment or the control group, the difference in the probabilities of the two models may not capture the lift precisely.
Differences in scales of the two models could be responsible for this phenomena. In addition, the amount of error could be doubled since both the treatment model and the control model will contribute errors.
Model #3: Using a Single Model with Treatment Indicator Variable
To alleviate the issues that arises with the use of two models, we will return to the single model approach. But this presents another problem. How can we effectively capture the impact of the promotional campaign with a single model?
The answer is simple. In order to model the impact of treatment (receiving a promotion) with a single model, we can create a new indicator variable to track whether an individual receives the treatment during training. Not too complicated after all!
For individuals in the training set, the treatment indicator variable will be set to 1 if the individual received a promotion, and 0 if the individual did not receive a promotion. The XGBoost model is then trained on the original V1-V7 features and the indicator variable.
How do we then predict whether a new individual is likely to respond favorably to the promotion? Do we set the treatment indicator variable to 0? or 1? After all, our task is to predict whether that individual belongs to the treatment group (we should send a promotion to the person) or the control group (we should not send a promotion to the person). The answer to this question is both 0 and 1.
By setting the treatment indicator variable to 1 for the individual, we can predict the probability he or she will make a purchase in response to the promotion. Next, we should set the indicator variable to 0 and predict the probability he or she will make a purchase without receiving the promotion.
# To predict whether a new individual should receive a promotion# Fit a model with treatment = 1 for all data points
test['treatment'] = 1.0
preds_treat = model.predict_proba(test,\
# Fit a model with treatment = 0 for all data points
test['treatment'] = 0.0
preds_cont = model.predict_proba(test,\ ntree_limit=model.best_ntree_limit)
lift = preds_treat[:,1] - preds_cont[:,1]
The difference in the two probabilities, also know as the lift value, will tell us how much we can improve the individual’s probability of making a purchase as a result of sending a promotion.
This approach essentially captures the essence of the two model approach, but with the use of a single model. Hence, we can avoid the issues of scaling mismatches and amplification of errors.
With this approach, if we were to send out promotions to every individuals with positive lift values in the test data set, we will observe an IRR of 1.54% and a NIR of $42.60. While this is lower than what the Starbuck’s model achieved, it still represents a significant improvement on the two model approach.
As was the case with the two model approach, you can opt to send the promotions only to individuals with lift values higher than a cut-off percentile. This might lead to some improvements in both the IRR and NIR values.
Model #4: Four Quadrant Approach
This will be the final approach that we will discuss. Don’t let the name intimidate you. It is really simple as well. This approach utilizes a single model to predict the probability of an individual belonging to one of the 4 categories (quadrants), and hence its name. These categories are:
- TR: the treatment and response group. Individuals in this group received a treatment (promotion) and responded (made a purchase)
- CR: the control and response group. Individuals in this group received no treatment (no promotion) but still responded (made a purchase)
- TN: the treatment and no response group. Individuals in this group received a treatment (promotion) but did not respond (made no purchase)
- CN: the control and no response group. Individuals in this group received no treatment (no promotion) and did not respond (made no purchase)
In other words, this is simply a multi-class classification model with four classes. Thankfully, XGBoost can handle multi-class classification out of the box, so relatively few modifications to the code is needed. Other classification models that can handle multiple classes will also suffice.
The only catch for this approach is that we need to separate the training data into the appropriate groups and assign the correct labels. One method to generate the labels for the XGBoost model is shown below.
target = 
for index, row in train_data.iterrows():
if (row['Promotion'] == "Yes") & (row['purchase'] == 1):
# TR group
elif (row['Promotion'] == "No") & (row['purchase'] == 1):
# CR group
elif (row['Promotion'] == "Yes") & (row['purchase'] == 0):
# TN group
else: #CN group
target.append(3)train_data['target'] = target
The rest of the process is relatively similar with previous approaches.
Next, let us discuss how predictions could be performed. If a model predicts that an individual belongs to class TR, it is likely that he or she will respond favorably to the promotion and we should send a promotion to that individual. If other classes are predicted, we should not send a promotion.
This approach yielded an IRR of 1.55% but an NIR of $5.90. While these numbers are among the lowest of the 4 approaches, it is still an improvement over our baseline approach. Since the Starbucks’ dataset is by no means representative of all real-world data, you should still give this model a try on other datasets. Perhaps, it might yield more favorable results.
So far, we have seen how uplift modeling can be used to better identify individuals who will respond favorably to your marketing campaigns. With these models, you can reduce the cost of marketing and enhance the value of your marketing campaign.
To re-iterate, we have covered these uplift models in this article:
- Traditional Uplift Approach
- Two Model Approach
- Single Model with Treatment Indicator Variable
- Four Quadrant Approach
You should definitely experiment with all of these approaches and see which approach best suits your needs. The list of uplift models presented here is by no means definitive, and there are numerous other approaches as well. If you are interested, some additional readings for the topic can be found here: link 1, link 2, link 3.
For this article, no feature selection nor feature engineering was performed. In addition, very little emphasis was placed on fine-tuning the selection of machine learning models as well, since we wanted to standardize the model to better compare the different approaches. It is certainly possible that the performance of these uplift models could be improved by incorporating these steps.
Real-world data can also be far more complex than the example presented here. There could be incomplete data, multiple treatments groups, non-constant product prices, cases of product returns, etc.. All of these would require modifications to the modeling process.
Uplift modeling can be rather complex, but they can bring tremendous value to your business. Fortunately, there are a couple of pre-implemented libraries and services which you can use. Do note that some of them are paid services.
The code for this article can be found here.
Thank you! I hope you enjoyed this article and learned something new. I do not profess to be a master of this topic, so if you spot any errors, let me know. If you want to share alternative uplift modeling approaches and tips, leave a comment below! I am always happy to learn new techniques.