How to Use Machine Learning to Accelerate AB Testing
Best of Both Worlds, Part 2: Double Selection & Double ML for Covariate Selection
This post was co-authored with Emily Glassberg Sands and is Part 2 of our “Best of Both Worlds: An Applied Intro to ML For Causal Inference” series (Part 1 here). Sample code, along with basic simulation results, is available on GitHub.
We’re grateful to Evan Magnusson for his strong thought-partnership and excellent insights throughout.
Whether we’re AB testing, running more sophisticated experiments, or analyzing natural experiments, noise inevitably creeps in and slows us down. For any given effect-size, this noise increases time to significance — and to decision-making.
Want to get to the winning answer faster? Adding in controls (covariates) in a principled way can increase the precision of our estimates and speed up our testing processes. In this post, we cover two machine learning methods for principled covariate selection, and show the impact on precision in the context of AB testing. We also walk through how these same methods can be applied in a range of broader causal inference methods outside of AB testing. Throughout we discuss common gotchas, and explain why implementing these methods incorrectly can be problematic at best and simply wrong at worst.
Suppose we want to compare the effectiveness of two different promo units on click-through rates. We spin up a simple AB test, randomly assigning half of users to one of the promo units, and half to the other. Because users are randomly assigned, the difference in mean click-through rate between the two groups should be an asymptotically unbiased estimate of the true difference — especially if we wait for enough users to come through. But in the shorter run, there are random differences between the users in the two groups that creates some noisiness in our estimates.
We can reduce that noise by controlling for key user and behavioral characteristics that are likely to be correlated with the outcome of interest (click-through rates). We of course need these to be characteristics that preceded the treatment — so as not to induce included variable bias. But even among that set of potential controls, choosing wisely matters.
The challenge — and opportunity — in tech is that we generally have a very high-dimensional feature sets of potential controls. We know where the user is geo-located, the device they are using, every activity they’ve ever done on the site at each point in the past (e.g., searches, click-throughs, pageviews, purchase events), and so much more. And since we often don’t know ex ante exactly how these demographic and historical behavioral features vary with our outcome of interest, we may want to consider transformations of these features like dummies or quadratics, plus interactions across features.
The feature set quickly becomes too large for the full model to provide valuable insight. In the extreme case, the number of features in our model, p, exceeds the number of observations in our data set, n, and an ordinary least squares (OLS) model fitting the outcome on the treatment with the full vector of potential controls is out of the question: OLS will perfectly fit the data — including the noise. Even when p < n, we run the risk of over-fitting and ending up with mush for estimates.
At the other extreme, we could hand-pick a subset of covariates to include. And we often do. However, this type of data mining can result in cherry-picked results that may well not capture the true effect, especially in cases where our intuition on key covariates is weak and the set of potential covariates is large.
What we need, then, is a principled way of reducing the covariates to the set that really matters. Lucky for us, machine learning has a lot to say on dimension reduction, or regularization. With some twists designed by Alexandre Belloni, Victor Chernozhukov, and Christian Hansen, regularization techniques for principled covariate selection can now be leveraged for more powerful inference.
Solution 1: Double Selection
It all starts with regularization. For this example we’ll use LASSO (least absolute shrinkage and selection operator), but there are a range of alternatives including elastic nets and regularized logits.
In LASSO, the objective function is
Intuitively, when λ is 0, the LASSO equation is equivalent to OLS. What LASSO adds to OLS is a penalty for model complexity, governed by λ. The higher the penalty parameter, the fewer the number of variables from the covariate vector X appear in the fitted model (i.e., the fewer the number of coefficients β that are not 0). This sends to 0 coefficients of variables that don’t sufficiently help in reducing mean squared error in the model. LASSO is quite appealing because of its flexibility; in principle it can handle common, more complex situations in which we have clustering, fixed effects, non-normality, and more.
Returning to our example, we use LASSO to regress the outcome Y on the vector of potential covariates X. In R,
lasso.fit.outcome <- cv.glmnet(data.matrix(X), df$Y.ab, alpha=1)
Here, we have not actually specified the value of the penalty parameter λ. Instead, λ was chosen via cross-validation. This is a common way to choose λ for prediction, and is fine for inference in many cases as well. The references at the end of this post provide more guidance more on choosing λ in a range of contexts.
Like all of us, LASSO has its limitations. First, since it’s managing the tradeoff between model fit and complexity, when covariates that should be included in the model are highly correlated with each other LASSO will often choose to include only one. This leads LASSO to overfit on a specific covariate, and alternative techniques like elastic nets and PCA can alleviate this.
Second, and more important in our context, coefficients are penalized for their magnitudes, not just for being non-zero. This is known as shrinkage, and biases estimated coefficients toward 0.
In the prediction setting, we overcome this bias with post-LASSO: First, we implement LASSO to obtain a set of features H (from the full feature set X) that are to be used in predicting the outcome Y; then, we implement a standard OLS regression of Y on H, and the coefficients estimated on the variables in H will not be biased towards 0. (Note: For one reason or another, forgetting to do a post-LASSO step is a common mistake — don’t be that data scientist!) In the prediction context, post-LASSO is sufficient.
But when our goal is inference, we can’t stop there. With LASSO, features that affect the outcome Y in a very small way and / or are correlated with other selected features won’t be selected. This means the post-LASSO regression can actually suffer from omitted variable bias. The omitted variables may affect the outcome Y directly, or indirectly through the treatment T (even if our treatment should be purely random). And unfortunately for inference purposes, variables in this latter group are actually more prone to not be selected by the LASSO of Y on T and X because of the inclusion of the treatment feature, T, itself.
The solution is Double Selection, which proceeds in a few straightforward steps.
In Double Selection, we first obtain the set of features chosen by LASSO when regressing Y on the full set of potential covariates X; we’ll call the resulting vector of chosen features H. Next, we obtain the set of features chosen by LASSO when regressing T on X; we’ll call that K. In R,
lasso.fit.outcome <- cv.glmnet(data.matrix(X), df$Y.ab, alpha=1)
coef <- predict(lasso.fit.outcome, type = "nonzero")
H <- X.colnames[unlist(coef)]
lasso.fit.propensity <- cv.glmnet(data.matrix(X), df$T, alpha=1)
coef <- predict(lasso.fit.propensity, type = "nonzero")
K <- X.colnames[unlist(coef)]
If you take a peek at the code, you’ll note that, in the results of this example, H does not actually include some variables which are inputs in the functional form of Y. This is because these excluded variables are highly correlated with other variables included in the LASSO results.
Additionally, both H and K include variables not in the functional forms of Y and T. T, in particular, was assigned randomly, and so we would hope that K would be empty. But it isn’t because the AB test assignment isn’t perfectly random in even reasonably sized samples.
Finally, with both sets of chosen features in hand, we regress Y on T and the union of H and K. A simple OLS regression will suffice to estimate the causal effect of the feature change. In R,
eq.H_union_K <- paste("Y.ab ~ T + ", sum.H_union_K)
fit.double <- lm(eq.H_union_K, data = df)
Et voilà! We’ve leveraged techniques from machine learning to do more robust and principled causal inference — in this case for covariate selection.
How well does principled covariate selection stack up against other methods for causal inference? In the following plot, we show the true effect and compare the estimated effect of T using OLS models with a variety of different covariate specifications on simulated data:
- no covariates
- all possible covariates (this results in perfect multicollinearity)
- almost all possible covariates
- a limited set of covariates (where we guess some correctly, but include others which should not be in the model)
- the set of covariates chosen by Double Selection
As you can see, the competition isn’t even close: Compare each of the point estimates and the 95% confidence intervals around them to the true effect (the salmon-colored bar on the far left) of +2%.
First, consider the raw AB test results, represented by the gold bar; without controls, the test would be far from resolved as we’d estimate an effect size in the range [-1%, 12%].
Next consider the green bar where we throw in all controls — here we actually get the wrong answer because the model is able to perfectly fit the data, but due to overfitting. The turquoise bar (including the largest subset of covariates) and the blue bar (limited controls) do a little better than the raw (uncontrolled) case, but will be highly variable depending on the set of covariates chosen — and is subject to data mining.
By far the best performing is the Double Selection method (in pink), which has a point estimate almost exactly on par with the true effect, and narrow error bars around that estimate. With the exact same experiment but slightly different analysis methods we went from
I’ve little idea; the 95% confidence bands range from -1% to +12%??
The 95% confidence bands range between+1.9% and +2.2%
Recall the true effect is 2%, so that’s pretty good! Double Selection dramatically outperforms raw (uncontrolled) mean comparison and each of the unprincipled covariate selection methods.
Now, Double Selection admittedly won’t always look this good. In this case, our simulated data is quite noisy and we observe many of the underlying features driving that noise, so correctly controlling for those features makes a huge difference. But in our experience, that scenario isn’t unrealistic in many tech contexts.
Additional Example: RDD
Double Selection isn’t only relevant for analyzing AB testing — this same approach is powerful for analyzing a whole host of natural experiments using a range of more sophisticated econometric techniques for causal inference. Let’s take just one more example…
Suppose we want to measure the effect of a global change like newly launched videos on a user’s 1-day time spent onsite. This isn’t a change we can AB test: because users on our platform talk to each other, the control group would be contaminated by knowledge of the treatment group’s experience — not to mention frustrated about not having access to the new content themselves. So we roll the new videos out to all users at once.
To inform investment in future new video launches, we still want to understand the impact of the launch. And we can’t simply look at time spent on the new videos because of interactions across content: Since videos can be substitutes (or complements) time spent on new videos may over- (under-) estimate the net effect on sitewide viewing time given cannibalization (positive spillovers).
What we can do is use a regression discontinuity design (RDD for short) to estimate the site-wide impact. In brief, the RDD design compares outcomes between users on the site in the day(s) just before the launch to users on the site in the day(s) just after.
We believe users are quasi-randomly distributed around the launch — we didn’t announce and no other events were co-timed — so differences in outcomes are between the two groups plausibly capture the effect of the new content launch. Even so, as before we may want to control for key user and behavioral characteristics that are likely to be correlated with both the outcome of interest (time spent onsite) and potentially also the treatment (being just after vs. just before the launch). Adding these controls can reduce noise and increase the precision of our RDD estimate.
The process is the same as before. Let Y be the outcome, T an indicator variable for before/after the change, and W the running variable. We first obtain the sets H and K, as before. Then, instead of implementing an OLS, we do a standard RDD estimation, using the union of H and K as controls.
fit.rdd <- RDestimate(eq.H_union_K, data = df)
This should be considerably more precise than the naive RDD alone.
Solution 2: Double Machine Learning
Double Selection is simple to implement and quite intuitive, but these days there’s a new kid on the block known as Double Machine Learning, or Double ML. The main idea behind Double ML is to think of the T parameter as the solution to multiple prediction problems, which are carefully set up and averaged so that the bias in the prediction problems doesn’t actually show up in T.
Relative to Double Selection, Double ML removes asymptotic bias on the estimated coefficient of T, so in the right setting it’s likely to be the clear winner. However, it’s a bit less intuitive, and can be more complicated to use depending on the method you’re employing for inference (such as an RDD). It also involves splitting the sample, which means yet another step in the estimation procedure, and if your sample is not sufficiently large then the cost of splitting the sample may not justify the improvement in bias reduction. We’ll cover Double ML only briefly here; interested readers should check out Chernozhukov et al. (2017) for more details.
How does Double ML work?
It starts with something that looks a lot like Double Selection — obtaining the vectors of important features H and K.
We then regress Y on H, and compute the difference between Y and the predicted values of Y from the model (i.e., the residuals) which we’ll call Y*. We similarly regress T on K, and compute T*, the difference between T and the predicted values of T from the model. Finally, we regress Y* on T*. The resulting coefficient on T* is the point estimate for the causal effect of T on Y.
This entire process should be run on subsamples of the data, after which we average out the estimated coefficient of interest to obtain our estimate of the causal effect. That’s it!
An important attribute of Double ML is that we need not necessarily use LASSO and OLS to compute residuals; other tools like random forests, boosted trees, and even deep neural nets can also be used in those first steps.
Making use of these more sophisticated models obviously has great potential and is cause for excitement. But as with any modeling choice, we recommend making your decision of what tool to use based on the problem in front of you — and not the shininess of the tool. In our experience, it’s best to start simple and then incrementally develop a perspective on where adding complexity will — and won’t — be worth it. Always keep the benefits of simplicity in mind, which include interpretability, explainability, maintainability, and just the ease of knowing when things have gone awry.
In this post, we’ve covered two machine learning techniques for principled covariate selection. Whether you’re AB testing, running more complicated experiments, or estimating causal relationships from natural experiments, the methods here can help you get the best controls to learn the right answer faster. Want to learn more? Check out the references below.
Coming soon in our next post, we’ll cover ML for instrumental variable selection with many potential instruments.
1. Belloni, Alexandre, Daniel Chen, Victor Chernozhukov, and Christian Hansen. 2012. “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain.” Econometrica 80(6): 2369–2429.
2. Belloni, Alexandre, Victor Chernozhukov, Ivan Fernandéz-Val, and Christian Hansen. 2013. “Program Evaluation with High-dimensional Data.” arXiv e-print, arXiv:1311.2645.
3. Belloni, Alexandre, Victor Chernozhukov, Christian Hansen, and Damian Kozbur. 2014. “Inference in High-Dimensional Panel Data.”
4. Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “High-Dimensional Methods and Inference on Structural and Treatment Effects.” Journal of Economic Perspectives, 28(2): 29–50.
5. Chernozjukov, Victor, Denis Chetverikov, Mert Demirer, Esthen Duflo, Christian Hansen, Whitney Newey, and James Robins. 2017. “Double/Debiased Machine Learning for Treatment and Causal Parameters.”