Ibotta is an app that allows users to save money on everyday purchases. But we also invest a lot of money on marketing. Most of what we spend is on programs meant to increase user activity in the app. While these programs are effective in increasing usage they are also expensive. In order to maximize usage while keeping costs down we don’t give everyone the same programs. Instead we assign the programs and bonuses based on user level behavior.
To do so we use uplift modeling to make the assignment decision. This is a type of statistical learning that tries to assess the effect of a treatment for a particular user given their features. For a more general introduction see here.
The most challenging part of building uplift models is on assessing model performance. This is crucial for model selection and grid-searching over hyper-parameters. Unlike traditional ML models we don’t see all of the outcomes available due to the fundamental problem of casual inference. This states that we can only see one treatment and response pair per user. However, we are interested in the counterfactual or what would have happened if we gave another treatment to the same user. Since we cannot observe two treatments per user this can be considered (with several assumptions) as a missing data problem. One could run a new experiment with the proposed model’s treatments but this is a costly and time consuming process.
The Qini metric has traditionally been used as an offline metric and is analogous to an AUC metric for classification. But there are a few limitations with this metric. It does not tell you how your model will perform or the expected increase in response metric of interest. Secondly it doesn’t generalize easily to multiple treatments.
There is a new metric that allows us to simulate an experiment and assess model’s performance offline using randomly assigned data. We’ve called it ERUPT or Expected Response Under Proposed Treatments. With it we can now assess a model’s performance and find a better model without additional experimentation.
This is a metric I thought of and wrote a blog post here. Turns out I wasn’t the first to discuss it and two other papers (that I know of) discussed the metric too: one by Zhao, Fang, and Simchi-Levi and another by Hitsch and Misra. They went into more depth and generalized the metric to include non-random assignments and non-uniform treatments. I encourage you to see their works for a more mathematical and in depth explanation. Here I’ll just go over the intuition and example.
This post will go over a general problem formulation of uplift modeling, current metrics, the ERUPT metric, and some limitations of it.
Uplift Problem setup:
The general setup for a lift model is:
y: Response variable of interest you’d like to maximize. Could be something like user retention.
X: User level covariates. Includes things like previous in-app behavior per user.
T: The randomly assigned treatment. In this case it is whether or not to give a bonus to a particular user and is binary. Assume that the distribution and assignment of a treatment is uniform and random.
With the data (y, X, T) the goal is to build a treatment assignment policy 𝜋(x) that will use X to assign T that maximizes the value of y. Or in this case we want to use user history to assign whether to give a bonus to a user in order to maximize retention.
A frequent practice is to model the expected outcome y _i under different Treatments and choose the T that maximized y_i for each user.
There are several approaches to do this and can be done with a run of the mill ML algorithm that incorporates interactions like a random forest. To get the counterfactual for each treatment you just need to predict with different values of t and for each user select the treatment which has the highest expected value of y. Fun Fact: this calculation is closely related to to creating an ICE plot with the treatment variable.
So now we have a vector of proposed treatments on a subset of users. How can we tell if this is a good assignment policy?
The ERUPT Metric:
First let’s run through it intuitively. Suppose you have an observation where 𝜋(x) proposes a treatment of not giving bonus and the randomly assigned treatment was given a bonus. Since these do not align it’s not clear we can say anything about it.
However, if the optimal treatment for a model is equal to the assigned treatment we can include that observation in our proposed treatment examples. We go through this exercise for all observations and calculate the response mean for only those where the 𝜋(x) = assigned treatment. This is our estimated value of y under the model! Mathematically it is:
Below is a spreadsheet showing the proposed treatment 𝜋(x), along with randomly assigned treatment, response variable, and whether the proposed treatment is equal to observed treatment.
We subset the data to those observations where 𝜋(x) is equal to the randomly assigned treatment and the mean of the response for those observations. In this case the expected response under proposed treatments (ERUPT) is (1.43+-1.02) / 2 = 0.205.
We can use this as a metric to grid-search and assess model performance without additional experimentations.
The biggest drawback of the ERUPT metric is that it does not use all observations when calculating the expected response. In the above example we effectively ‘throw out’ 3 / 5 observations. Fewer observations means a higher variance in the estimated counterfactual so you may see noisy results.
How many observations you’ll ‘lose‘ is dependent on the distribution of treatments and optimal treatments. If the treatments are uniformly distributed then the expected number of matched examples will be n_obs/num_tmts. So if you have an experiment with 100,000 observations and 5 treatments you’ll end up with ~20,000 observations to estimate the response under a new model.
This should be ok if you have if you have a small number of treatments relative to observations. However, if you are trying to estimate several optimal arms of treatments within the same model then the combinations of unique treatments increases exponentially and can be too large to estimate the response with a high degree of certainty.
This post went over a new metric called ERUPT to estimate the effect of uplift models. It it is more general then current metrics such as Qini for uplift modeling. This will make model search and reporting more accurate for uplift models. However, there can be drawbacks if you have too many treatments relative to observations. In my next post, I’ll discuss on how to optimize this metric directly using a loss function in keras.
IbottaML is hiring Machine Learning Engineers, so if you’re interested in working on challenging machine learning problems like the one described in this article then give us a shout. Ibotta’s career page here.