Computing incremental sales at Untie Nots

Julien Louis
Untienots
Published in
7 min readOct 3, 2022

Part 5: Using a more robust approach with causal inference

Looking for the start ? Click here for part 1

In this part of the series, we will look into the details of using causal inference to measure the incremental sales at Untie Nots. For more information on the context, don’t forget to read part 1. You will also need a basic understanding of regression and classification machine learning models to understand how it all works.

Reframing the goal as a causal inference problem.

According to wikipedia: Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system.

We can formalize the problem as 3 sets of random variables arranged in a causal graph.

X: ALL the confounding variables that affect the outcome and the intervention
T: the intervention, or the variable whose causal effect on the outcome we are trying to determine
Y: the outcome of the experiment we measure

The direction of the arrows show causation

We can fit our problem in the above framework:
X: all the personal and sales data on the exposed customers before the campaign
T: whether or not the exposed customer has played
Y: the exposed customer sales during the campaign

Why not use A/B testing ?

In an ideal world, one would use an A/B test protocol to solve this issue: separate the people into two homogenous groups according to X. Apply different treatments on the two groups and measure the difference in outcome. However, in some cases, especially in medicine or economics, doing so would be unethical or unfeasible. That is where causal inference has a key advantage, it doesn’t require a witness group and works on the entire population directly.

That characteristic is very useful for our marketing case, as we do not want to restrict the access to the campaign. For the rest of this blog post, customers will only refer to the exposed customers.

Formulas for the incremental

In the case of a binary treatment, if he plays or if he does not, there are two potential outcomes for every customer. Those can be written down as Y1 for the treated outcome, the sales if he plays, and Y0 for the control outcome, the sales if he does not play.

So for every customer, we can compute the Conditional Average Treatment Effect (CATE), which represents the incremental sales for a single customer due to the campaign.

By summing over the entire population, we can compute the Average Treatment Effect.

It is important to note that the ATE is not the incremental sales brought by the campaign, it is the difference between total sales if everyone played and the total sales if no one played. We still put down the ATE formula as it is often the preferred metric in a causal inference problem.

For the incremental sales, what we are looking for is the Average Treatment Effects on Treated (ATT), where we only account for the effect of the treatment on the player population.

How do we get the values ?

Unfortunately for us, in the real world we can only observe one of the possible outcomes, we only know Y1 for the players, and Y0 for the non-players. The goal of the following methods is to estimate the ATE over all customers.

For the following models to work, it is important to respect 3 strong statistical assumptions:

  • SUTVA: Stable Unit Treatment Value Assumption. The potential outcomes for any unit do not vary with the treatments assigned to other units.
  • Common support: the treatments overlap in terms of X. This means that for every value of X, some receive T=1 and others receive T=0.
  • Ignorability: the outcomes Y0 and Y1 are independent of treatment assignment, conditioned on X. This means there is no hidden confounding variable that affects both T and Y1 or Y0.

Let’s check if we respect those 3 assumptions in our case:

  • SUTVA: the fact that a customer plays and his spending patterns should not influence another customer at all.
  • Common support: this can be checked empirically, we can plot the different values of X and see if the players and non-players overlap.
  • Ignorability: this is the harder one to check. It could be that another external factor affects the likelihood of playing and the purchase amount. It is why it is important for us to collect as much data on the customers as possible.

Now let’s compute the incremental

1) Covariate adjustment: regression model

The completed dataset, with the factual and the counterfactual data

In this strategy, we will try to estimate the counterfactual with a machine learning model. We will use X and T as the inputs of a regression model to predict Y. The model should learn what a customer’s expected sales (Y) is given his past spending (X) and whether or not he plays (T).

We get the ATE by summing over the entire population.

Finally, to get the incremental or the ATT, we can focus solely on the player population.

There are other ways to predict the counter factual values, for example, we could use the twin method we described in an earlier post to estimate the counter-factual value for each customer. But we have decided to use a recurrent neural network because:

  • it is more robust to outliers in the dataset
  • it scales well with the customer dataset size
  • the temporal aspect is better to take into account spending patterns

2) Inverse probability of treatment weighting

In a perfect randomised control trial, the ATE is straightforward to compute. Take the average outcome of the treated minus the average outcome of the witness.

The size of the circle is proportional to their weight. The customers with a low propensity score are likely to play, therefore, they will be weighted more in the control group.

The idea behind Inverse probability of treatment weighting is to create a simulated randomised experiment by weighting the samples based on their propensity score: their likelihood of playing.

To predict the probability that a customer will play given X, we will use a classification machine learning model with the customer data as input (X) and whether they played or not (T) as the label.

We choose a recurrent neural network for the training, as we felt that it takes into account the temporal nature of spending patterns. Other classification models could be more adapted to your problem, however, it is key that this model is well calibrated.

What does it well calibrated mean? In a well calibrated model, if a customer has an output of 0.4, that means that he indeed has a 40% chance of playing, not all ML classification models have this property. To check if a model is well calibrated, it is important to plot its’ calibration curve.

Well calibrated model (left) and ill calibrated model (right)

Furthermore, be suspicious of any accuracy score that is too high as it is probably a sign of overfitting. A key assumption is the common support between positive and negative data points, if the model can correctly predict the treatment, that assumption is likely not true.

And finally, we can compute the ATE and ATT by weighting the sales of each population.

3) Combining both strategies in a doubly robust estimator

We can combine the two strategies above to compute the ATE and the ATT for the incremental sales. This method is known as a Doubly Robust Estimator.

For the ATT, sum over Ti = 1

We decided to use this combination, even if it adds more complexity because of one key advantage: it is a consistent estimator if either the propensity score model or the potential outcome model is, but not necessary both are, correctly specified. This makes it more robust to handle the messiness of the real world data and allows us to be confident in our computation of the incremental sales.

To control for any leftover bias, we use a control period as described in part 3.

Notations used in the equations:

Xi: confounding variables for i
Ti: treatment for i
Y(Xi): observed outcome for Xi
Yi = Y(Xi)
Y1, Y0: outcome for T=1, T=0
Ỹi: estimated outcome for i by the regression model
pi: propensity score for i by the classification model
N: total population
N1, N0: population in treated, control groups

The End

Throughout this blog series, we have looked at a few different ways to compute the incremental sales of our marketing campaign. But which one should you use ? Well it all depends on the use case and priorities, so here is a table to help you decide.

Recap of the different methods seen

Thank you for reading our series on computing incremental at Untie Nots, this is the end….for now

Extra ressources
For those of you who want to dive even further, here are a few helpful links:
- A paper on doubly robust estimators
- A pedagogical website
- An online lecture

--

--