Multiple Responses in A/B Testing

Published in

Building Ibotta

6 min readOct 16, 2018

tl;dr

Our data scientists use previous experiments to build a prior on the relationships of treatment effects between several response variables. Using this prior information should 1) help make more accurate estimates and 2) mitigate multi comparison issues.

Introduction

At Ibotta the typical A/B test randomly assigns a treatment to a subset of the population with the goal of detecting an increase (or decrease) in a particular response variable, like downloads, redemptions or clicks . If the resulting treatment effect is larger than the noise associated with it at a specific cutoff then we conclude the treatment has a non zero (and hopefully good) effect.

Now suppose someone has the great idea to look to see if the treatment effects another response variable. This may be for one of several reasons; perhaps they want to look at different responses in a ‘funnel’ of a process or maybe it’s just to make sure that there are no harmful effects of the treatment in another area.

However, it’s not so easy to look at other response variables using p-values and traditional cutoff points because of multiple comparisons issues. By looking at several response variables at the same time you will increase the chance of coming to erroneous conclusions due to noise.

The traditional remedy for multiple comparisons is to decrease cutoffs (which decreases power) and to increase sample size (which may not be feasible or potentially costly.) Multiple comparisons from a Bayesian perspective has been discussed in Gelman et al but they did not focus on the multiple response case. Because of these constraints, A/B tests are often forced to have one response of interest.

Looking at one variable at a time makes sense under this framework. But I had a bit of a ‘not sure if’ moment: this method is making us look at less data to make a decision. Shouldn’t we look at all data available to make decisions?

This post will go over a technique we at Ibotta developed to help with these issues. First it will create a simulation whereby including a prior on the treatment effects increases accuracy. It will then discuss how to get this prior in practice. And finally a description of a real world test is discussed.

Data Generation Process

The data generation process for variables is as follows. We have a binary treatment t and several variables of interest x_iwith treatment effects b_i. The goal is to accurately estimate b_i for each response.

Both the treatment effects and the response variables themselves follow multivariate normal distributions:

b_1,...b_n ~ N(0,Σ_b)

x_1,...x_n ~ N(μ_x,Σ_x)

Note that in this analysis and example the correlation between treatment effects response variables are independent. I’d wager these are correlated to some degree in the real world but it’s generally not possible to know a-priori the relationships between these variables.

Below is the plot of a randomly created Σb distribution along with a random draw that represents the coefficients of the treatments at point (-1.04 ,-1.3). Note that the two treatment effects are highly correlated.

Fake Data Experiments

To simulate an experiment I add the treatment effect b_i to those that received the randomly assigned treatment t for all response variables x_ifor 1000 observations:

y_i = b_i * t + x_i + e where e ~ normal(0,1) noise variable and the randomly assigned treatment t ~ binomial(.5)

In real life we’d only observe y_iand t. Running separate regressions lm(y_i~t) for each response is estimated using R and I’ll call the ‘standard’ approach. The resulting b_i estimates with confidence intervals are shown.

It appears the estimated b_1 is slightly larger in magnitude than the true value. Stopping there we may overestimate the negative-ness of effect.

However, since the treatment effects are highly correlated we can incorporate that as prior information. This should help get better estimates for both coefficients. This is estimated using stan and the estimate is shown below.

One can see that the estimates go towards the direction of the ‘line’ of true effects. Using a prior in this particular case decreases the error between estimated and actual treatment effects. Another way of looking at the above chart is from a Bayesian updating perspective; the ‘Standard Estimate’ can be seen as the likelihood and the ‘Bayesian Estimate’ is the posterior.

But will this reduction in error be true in general? Repeating this experiment with new data randomly creating x_i, t_i, and e while keeping the same b_i we can get an idea of how much more accurate this method is compared independent regressions.

Below is a plot of this experiment where the original estimates and posterior estimates are shown. It shows how regularizing the estimates in this manner will push the estimates towards the line of the expected relationship.

For this particular example the MSE of estimates of this Bayesian method compared with original approach decreased by about 60%. This decrease suggests this method is useful but the actual improvements depend on several factors (how correlated the treatment effects are, the strength of treatment effects, noise in model, etc.)

Where to get the prior?

Ah the age-old question. Knowing a-priori what the relationships between the treatment effects are is hard especially in high dimensions. Fortunately at Ibotta we run a lot of experiments and can leverage the previous results to estimate these relationships. Going back over the last year we can get treatment effects for each experiment over several response variables and see the relationships. Below is a pairs plot of 4 variables treatment effects for a few experiments.

The correlations vary between pairs of response variable but there is certainly structure in this data to create a covariance prior for the treatment effects.

However, the assumption in using this prior for future experiments is that the relationships should be consistent in both time and scope. There shouldn’t be any trends or clusters of treatments. If there were then that would imply certain domains of experiments effect the response variables differently.

Visual and statistical tests were performed to make sure that there was no obvious clustering based on domain of experimentation. Another test showed there was no general increase or decrease in treatment effects over time. Based on this I expect the prior to be useful in estimating the treatment effects in real life experiments.

Simulating with real data

But how can we test the effectiveness of a new testing framework? In some sense we’ll never know the ‘true’ treatment effects of an experiment so another simulation will have to do.

To test this with our own data I chose an experiment with a particularly large number of observations of 500,000. Estimating the treatment effects using all observations were considered ‘ground truth’ for 12 response variables (yes 12!) The simulation takes a random subset of 20,000 users and compares the standard approach with the Bayesian framework discussed.

The results show an improvement of about a third of prediction error compared with ‘ground truth’ observations. This suggests that the method works on real life data and can be used in production.

Conclusion

This post discussed how to use historical tests as a prior for treatment effects in order to sharpen estimates for several response variables. Not only can this mitigate multiple comparison issues but also increase the accuracy of the raw estimates.

If these kinds of projects and challenges sound interesting to you, Ibotta is hiring! Check out our jobs page for more information.