Machine Learning Meets Instrumental Variables

Best of Both Worlds, Part 3: ML for Instrument Selection

Published in

Teconomics

13 min readDec 31, 2017

This post was co-authored with Duncan Gilchrist and is Part 3 of our “Best of Both Worlds: An Applied Intro to ML For Causal Inference” series (Part 1 here, Part 2 here). Sample code, along with basic simulation results, is available on GitHub.

We’re grateful to Evan Magnusson for his strong thought-partnership and excellent insights throughout.

This post covers an exciting set of methods at the intersection of statistics, econometrics, and machine learning that allow data scientists to leverage vast repositories of AB results to estimate the causal effect of specific user behaviors on outcomes.

You probably have intuition that certain user behaviors drive the downstream outcomes your company cares about. For example, you might be launching a new version of your mobile app because you believe mobile app usage drives overall engagement. Or you could be expanding chat support to your customer service workflow because you believe chat support usage increases user NPS. You may even have some observational evidence that these behaviors are meaningfully correlated with the downstream outcomes your care about.

But changing the behaviors will only change the outcomes that matter if the correlation is causal. And causality is notoriously hard to glean from observational data because of confounders. For example, app users may retain better on the platform, but this could be simply because more engaged users are more likely to use the app — not because app usage causes retention.

AB testing alone won’t suffice here because you generally can’t experiment on user behavior directly. For example, it’s infeasible to randomize users into downloading the mobile app, or into actually using chat support.

What you can do — and what we walk through at length below — is hunt down (or even create) AB tests that drive these behaviors, and then analyze them in smart ways to back out the causal effect of the behavior itself on downstream outcomes.

In this post, we’ll show you how you can think of your AB tests as Randomized Encouragement Trials, estimate the causal effect of user behaviors on downstream outcomes using Instrumental Variables, and leverage machine learning to make all of this as seamless and high-powered as possible.

The Basics of Instrumental Variables for Causal Inference

Instrumental Variables, or IV for short, lets you measure the effect of a behavior on the outcome you care about by identifying off of random variation in that behavior.

The essence of the magical IV method is straightforward: We find an instrument (or set of instruments), Z, such that

Z meaningfully affects the behavior of interest, X, and
Z only affects the outcome of interest, Y, through its effect on X.

The first condition is referred to as the strong first stage assumption, and the second as the exclusion restriction.

If these two conditions hold, we simply instrument for X with Z (regressing X on Z, the “first stage”), and then estimate the effect of the instrumented X on Y (regressing Y on predicted X, the “second stage”). We provide an overview of IV in Section 5 of this post.

Historically, economists leveraged as instruments the randomness created by policies or other quasi-random variation in the world:

Josh Angrist and Victor Lavy estimate the effect of school class sizes on test scores by identifying off of random variation in class size generated by Maimonides rule. (In Israeli public schools, class sizes increase one-for-one with enrollment until 40 students, but when 41 students are enrolled, there is a sharp drop in class size to an average of 20.5 students.)
Angrist and Alan Krueger estimate the effect of years of schooling on earnings by identifying off of random variation in years of schooling generated by the Vietnam Draft lottery. (Men with randomly chosen birthdays were drafted and so were more likely to attend more years of school to avoid serving.)
Steve Levitt estimates the effect of prison population size on recidivism by identifying off of random variation in prison populations generated by overcrowding litigation. (Overcrowding litigation creates quasi-random shocks that reduce prison population sizes.)
We estimate the effect of opening weekend movie-going on subsequent movie demand by identifying off of random variation in opening weekend movie-going generated by weather shocks that weekend. (When it’s unexpectedly nice out, fewer people attend the movies.)

But what about in tech?

Instrumenting with AB Test Bucketing

IV methods are particularly powerful in tech because AB tests can themselves be thought of — and used — as instruments. These tests often directly drive some behavior of interest, thus unlocking our ability to understand how those behaviors affect downstream outcomes the company cares about.

Experiments used in this way actually have a special name: Randomized Encouragement Trials. Donald Rubin — yes, of Rubin Causal Model fame — and others have written about encouragement designs in the social sciences (see, e.g., here for a discussion).

Let’s take an example in tech.

Suppose we run an e-commerce site that allows buyers to leave reviews for products. We want to know if leaving a review causes a buyer to be more likely to retain, for example because the buyer feels more connected to the site.

If we simply regress retention on leaving a review, we’ll very likely overestimate the causal effect: buyers that leave reviews are more dedicated users than those who do not, and this confounding variable will upwardly bias our estimate of the relationship between leaving a review and retaining.

Ideally, we want to randomize users into leaving reviews and not. But leaving a review is not a behavior we can force.

Suppose at some point in the past we ran an AB test that drove reviews. For example, we sent emails to all buyers — half received an email that included a link to leave a review (treatment), the other half received an otherwise identical email that included no such link (control). We can use this AB test as an instrumental variable to estimate the effect of reviewing on retention.

Let’s denote the treatment (vs. control) bucketing with a binary Z, leaving a review as X and retaining beyond 30 days as Y. We estimate the effect of (instrumented) X on Y, where X is instrumented for with Z. In R, we can simply use the AER package and the ivreg command:

library(AER)
fit <- ivreg(Y ~ X, Z, data = df)
summary(fit, vcov = sandwich, df = Inf, diagnostics = TRUE)

This code does the two steps for us, but what’s happening under the hood? First, we are regressing X on Z using OLS — that’s the first stage. Then, using the estimated model, we are generating predicted values of X, lets call them X*. In the second stage, we’re regressing Y on X*. And voila the resulting coefficient is our estimated causal effect of leaving a review on retention.

Recall the two requirements for this approach to be valid:

We have a strong first stage. Here, this means that the treatment group was meaningfully more likely to leave a review than the control group. We can see this directly in the AB test results, and we’ll show shortly how to test for this statistically.
The exclusion restriction is satisfied. Here, this means that the AB test only affected retention through its effect on leaving reviews. Since the only difference between the two groups was that the treatment group received an email that included a review link, this is likely satisfied. If, however, the control group had received no email at all this condition would not be satisfied since receipt of an email may itself affect retention.

Why does this work? Recall we couldn’t simply regress Y on X because of this unobservable confounder, C, that was influencing both Y and X; that is, changes in X weren’t exogenous to Y. So we instead computed this X*, which by construction is a function only of Z. If Z satisfies the exclusion restriction, variation in X* is exogenous to Y, and the effect of a change in X* on Y is the causal effect of X on Y.

And how how well does this work? Very well if the two assumptions above are satisfied — and much less well if they are not. The plot below shows an example on simulated data both the true effect of X on Y, and the estimated effect in different scenarios.

Corresponding code available on *GitHub*

The true effect (red bar) is 2%. A good instrument (gold bar) gives us an estimate of ~2.04% — and we’re 95% confident that the true effect is between +1.96% and +2.12%. This is far better than the naive OLS estimate (green bar), which gives us a biased estimate of the true effect, and with misleadingly narrow confidence bands.

However, IV done wrong can be just as misleading as OLS — if not more! When we use a weak instrument (blue bar) our estimate is both biased and imprecise. And using an instrument that doesn’t satisfy the exclusion restriction (pink bar) is even worse: not only is the estimate biased, but precise, leaving us dangerously confident in that biased estimate.

Clearly, good instruments matter. So how do we know if an instruments good?

To check for a strong first stage, we look for a statistically significant (and not close to 0) relationship when we regress X on Z. A good rule of thumb is that if the F-statistic in the first stage should be at least 11 we can be reasonably confident that Z drives X. The ivreg function reports this directly as the weak instruments diagnostic. If Z is not a strong predictor of X, then we have a weak instrument and the resulting second stage estimate will be biased — generally in the direction of the naive OLS estimate — and imprecise.
The exclusion restriction isn’t directly testable. It requires thinking about the possible relationships between Z and Y. If Z going to affect Y through any other channel than the endogenous regressor X, then X* is still an endogenous regressor and Z is not a valid instrument.

A quick note on the error bars: The R code to print the IV results isn’t just a basic call to summary(fit) because standard errors have to be corrected to account for the two stage design, which generally makes them slightly larger. (We have to account for the fact that there is variability in both stages, not just the second.) If you’re manually running the two stages, you’ll want to make sure to similarly correct second stage standard errors.

When Many Possible Instruments: LASSO for Instrument Selection

Using a single instrument to compensate for an endogenous regressor is rather straightforward. But what about when the company has run not one but dozens or even hundreds of tests driving key behaviors?

We could include all the tests as instruments, but that would risk overfitting the first stage. (And if the number of instruments exceeds the number of observations N, the first stage won’t even run.) We could instead hand pick a few tests to use as instruments, but how will we know if we got the best ones? And if we find the ‘best’ ones, are we just data mining?

If you’ve read this post, you may be thinking, “or we could add in some machine learning,” and you’d be right! Belloni, Chernozhukov, and Hansen outline how to use LASSO — among other methods — to optimally instrument an endogenous regressor using the fewest possible instruments.

Let M represent the full set of possible instruments after removing instruments that don’t satisfy the exclusion restriction; (remember there is no test for this, you have to think about the validity of each experiment as an instrument). Then, we can proceed in two steps:

Run a LASSO of X on M. Let Z denote the resulting set of variables selected by LASSO.

library(glmnet)
lasso.fit <- cv.glmnet(M.excl, df.excl$X, alpha=1)
coef <- predict(lasso.fit, type = “nonzero”)
M.excl.names <- colnames(M.excl)
Z <- M.excl.names[unlist(coef)]

Run an IV of Y on X, instrumenting for X with Z. Be sure to check the strength of your first stage in ivreg (or, by running your own post-LASSO first stage).

Z.list <- paste(“~ “, paste(Z, collapse = “ + “))
fit.lasso <- ivreg(Y ~ X, Z.list, data = df.excl)
summary(fit.lasso, vcov = sandwich, df = Inf, diagnostics = TRUE)

The ivreg output shows that we can reject the null hypothesis of weak instruments. We can also look at the output of the 1st stage directly by doing a post-LASSO of X on Z. In R:

firststage <- paste(“X ~ “, paste(Z, collapse = “ + “))
fit.firststage <- lm(firststage, data = df.excl)
summary(fit.firststage)

Scrolling through the results on our simulated data, we see an F-statistic of over 150— certainly well above our 11 rule-of-thumb — and reject that our LASSO instruments are collectively weak.

So how do these LASSO-chosen instruments stack up against other methods? The plot below shows on simulated data the true effect of X on Y, and the estimated effect in different OLS and IV scenarios.

The true effect of X on Y (red bar) is +2%. Using LASSO to select instruments (green bar) gives us the least biased estimate of the effect +1.98%. In this example, using all available instruments (purple bar) isn’t actually that much better than just running naive OLS (teal bar).

Incorporating Controls with Instrumental Variables

An astute reader of the series may be asking, “What if we want to add in additional controls to reduce variance and/or bias in our estimates?” Well, you’ll need additional controls in your IV, too.

In a simple IV where you aren’t using LASSO, you can just include a set of controls, K, in both the first and second stages. But when running a LASSO to select instruments, it gets a little trickier. You have two main options:

The first option, relevant if you have strong priors about the set of K to include, is to run a LASSO of X on the union of K and M, and simply force LASSO to include all of K in the final set of chosen variables. An alternative way to operationalize this would be to first residualize Y, X, and M by regressing each variable in Y, X, and M on the set of K, and then run the above process of selecting instruments using the residualized variables. That’s what we do in our paper instrumenting for opening weekend movie-going with weather that weekend. (Because weather isn’t exogenous to movie demand — movies are slotted for certain times of the year when general weather patterns are known and counted on — we really do need additional controls in the IV, not just to increase precision but also to remove bias.)
The second option, relevant if you have a large set of possible K and want to use LASSO to narrow down the set of controls, is to run a LASSO of Y on X and K, take the resulting set of controls, and proceed with the IV analysis from there.

Deep Learning for Instrumental Variables

A key limitation of the two stage IV method outlined above is that it estimates a local average treatment effect, averaging across the users who were moved to the behavior by the instruments. But we might imagine different groups are affected by leaving a review in different ways. For example, the effect may be different for first-time buyers than for repeat buyers. If we want to run a counterfactual experiment of changing the way our review system works, we would want to take this heterogeneity into account.

Nonparametric IV methods have been crafted for precisely this purpose, but are typically computationally infeasible with a large number of covariates, instruments, and/or observations.

Enter Deep Instrumental Variables, a frontier method developed by Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy (2017). Deep IV is a two stage procedure that utilizes deep neural networks:

In the first stage, we fit the conditional distribution of X given a set of instruments and controls with a deep neural net; (this is not so different from the first stage outlined above).
In the second stage, we incorporate the fitted conditional distribution to minimize a loss function using a second deep neural net, and use an out-of-sample causal validation procedure to tune the deep network hyper-parameters.

A Python library for Deep IV built on top of Keras is available on GitHub.

Innovations like Deep IV are exciting. But remember to keep the benefits of simplicity in mind — interpretability, explainability, maintainability, and just plain knowing when things have gone awry. Start with the simplest possible solution and build from there!

With holiday code freezes still in place, now’s the perfect time to learn from the past. The methods here should empower you to estimate the effect of key user behavior on the downstream metrics that matter to your company.

Interest piqued by the discussion of effect heterogeneity? Subscribe to our Teconomics blog on Medium to get notified of our next post — we’ll be covering causal trees for estimating heterogeneous causal effects.

References

Jonathan Barnard, Constantine E. Frangakis, Jennifer L. Hill, Donald B Rubin (2003). A principal stratification approach to broken randomized experiments: a case study of school choice vouchers in New York City (with discussion). Journal of the American Statistical Association, 98, 299–323.
Duncan Sheppard Gilchrist and Emily Glassberg Sands (2016). Something to Talk About: Social Spillovers in Movie Consumption. Journal of Political Economy, 124:5, 1339–1382.
Alexandre Belloni, Victor Chernozhukov, and Christian Hansen (2011): “LASSO Methods for Gaussian Instrumental Variables Models,” 2011 arXiv:[stat.ME], http://arxiv.org/abs/1012.1297.
Jason Hartford, Greg Lewis, Kevin Leyton-Brown, Matt Taddy (2017): “Deep IV: A Flexible Approach for Counterfactual Prediction.” Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 1414–1423.