Maximizing The ERUPT Metric for Uplift Models

Sam Weiss
Building Ibotta
Published in
4 min readAug 6, 2019
max ERUPT

Introduction:

My previous post described an off-line metric for uplift models that estimates the Expected Response Under Proposed Treatments or ERUPT. This metric gives us an estimate in the change for the response variable we expect to see if we deployed the lift model to production.

This post goes a step further and looks at ways to maximize this metric. A brief overview of current methodologies is discussed. I introduce a new loss function and compare it with a more traditional multi-output model. The main benefit of this approach is that it easily fits into current Keras / Tensorflow models and does not require additional engineering. Finally, a simulation shows promising results.

Review of Proposed Methodologies:

There are several ways to build uplift models. In my previous blog post I discussed that both Zhao, Fang, and Simchi-Levi and Hitsch and Misra discussed this metric. In addition they suggested ways to optimize this directly.

Zhao, Fang, and Simchi-Levi propose a modified tree technique. Instead of choosing a split that maximizes mean squared error as a regular tree does, they have a custom splitting algorithm that directly optimizes the erupt metric.

Hitsch and Misra try a “Causal KNN Regression” that finds the nearest K neighbors of an observation for both the treated and non-treated observations and takes the difference in the observed responses to attribute the expected individual treatment effect for an observation.

In addition, there are Generalized Random Forests which build two models: the first is to take into account the variability of response with respect to the explanatory variables, and a second step that models the treatment effect as a function of explanatory variables (see section 6.1).

These all seem like good ideas and they may have the “best” performance for your current problem or dataset. However, these ideas are a little tricky to implement for my use case. There’s no production quality code for the first two methods and, while there is a package GRF for the third, it does not extend to the the multiple treatment case and is only supported in R.

It would be great to work on implementing any of these ideas for my production environment but that can be very time consuming and the results may not be worth the time I put in. I’ve found it’s best to try models in your pipeline as best as you can and then explore from there to see if it’s worth it.

Implementing a new Loss

Enter Tensorflow / Keras. It’s great and portable in that it can be deployed in many different environments. Importantly, it’s super easy to adjust the model’s architecture to fit your unique needs easily. Here I’ll talk about a new loss function that can maximize the ERUPT metric directly.

Recapping the general setup for a lift model:

y: Response variable of interest you’d like to maximize. Could be something like user retention.

X: User level covariates. Includes things like previous in-app behavior per user.

T: The randomly assigned treatment. In this case it is whether or not to give a bonus to a particular user and is binary. Assume that the distribution and assignment of a treatment is uniform and random.

𝜋(x): Treatment assignment policy.

The ERUPT metric takes the average response for observations where the randomly assigned treatment equals the assigned treatment.

The ERUPT Metric

It is not differentiable because it requires an indicator function. However we can approximate it by changing the indicator function to a probability estimate:

New Function to Maximize

Where p(k|x) is a function approximated by a neural network that has a softmax output and K is the number of treatments. This function will try to put the most weight on the treatment that will maximize y for observation i. Note that only the observed treatment’s weight will be used in this function much like the original ERUPT metric.

Does it work?

I simulated a problem with 2 treatments and one control, 100 explanatory variables, and 10000 observations. There are only 10 variables that are predictive of the response and these 10 variables are given interaction effects for the two treatments to simulate the heterogeneity.

I am comparing this new metric with a model that allows missing values and minimizes mean squared error. There are no differences in the network architecture itself. There are only two differences: 1) the optim model requires a softmax activation for its final output while the multi-output model requires a linear activation and 2) the loss functions.

The results show that the model that optimizes for the ERUPT metric is useful and shows gains relative to not doing anything while the multi-output model is worse than nothing.

Code can be found here.

Conclusion:

This method worked for this particular data but your mileage may vary. I’m not claiming this will be the “best” model or approach for uplift modeling. The advantage of this approach is that it is simple to test and implement compared to the more theoretical methods proposed in the literature. After all, the most useful model is one that’s used.

IbottaML is hiring Machine Learning Engineers, so if you’re interested in working on challenging machine learning problems like the one described in this article then give us a shout. Ibotta’s career page here.

--

--