5 Tricks When AB Testing Is Off The Table

An applied introduction to causal inference in tech

Published in

Teconomics

18 min readMar 24, 2016

This article was co-authored with Duncan Gilchrist. Sample code, along with basic simulation results, is available on GitHub.

At least a dozen times a day we ask, “Does X drive Y?” Y is generally some KPI our companies care about. X is some product, feature, or initiative.

The first and most intuitive place we look is the raw correlation. Are users engaging with X more likely to have outcome Y? Unfortunately, raw correlations alone are rarely actionable. The complicating factor here is a set of other features that might affect both X and Y. Economists call these confounding variables.

In ed-tech, for example, we want to spend our energy building products and features that help users complete courses. Does our mobile app meet that criteria? Certainly the raw correlation is there: users who engage more with the app are on average more likely to complete. But there are also important confounders at play. For one, users with stronger innate interest in the product are more likely to be multi-device users; they are also more likely to make the investments required to complete the course. So how can we estimate the causal effect of the app itself on completion?

The knee-jerk reaction in tech is to get at causal relationships by running randomized experiments, commonly referred to as AB tests. AB testing is powerful stuff: by randomly assigning some users and not others an experience, we can precisely estimate the causal impact of the experience on the outcome (or set of outcomes) we care about. It’s no wonder that for many experiences — different copy or color, a new email campaign, an adjustment to the onboarding flow — AB testing is the gold standard.

For some key experiences, though, AB tests can be costly to implement. Consider rolling back your mobile app’s full functionality from a random subset of users. It would be confusing for your users, and a hit to your business. On sensitive product dimensions — pricing, for example — AB testing can also hurt user trust. And if tests are perceived as unethical, you might be risking a full-on PR disaster.

Here’s the good news: just because we can’t always AB test a major experience doesn’t mean we have to fly blind when it matters most. A range of econometric methods can illuminate the causal relationships at play, providing actionable insights for the path forward.

Econometric Methods

First, a quick recap of the challenge: we want to know the effect of X on Y, but there exists some set of confounding factors, C, that affects both the input of interest, X, and the outcome of interest, Y. In Stats 101, you might have called this omitted variable bias.

The solution is a toolkit of five econometric methods we can apply to get around the confounding factors and credibly estimate the causal relationship:

Controlled Regression
Regression Discontinuity Design
Difference-in-Differences
Fixed-Effects Regression
Instrumental Variables
(with or without a Randomized Encouragement Trial)

This post will be applied and succinct. For each method, we’ll open with a high-level overview, run through one or two applications in tech, and highlight major underlying assumptions or common pitfalls.

Some of these tools will work better in certain situations than others. Our goal is to get you the baseline knowledge you need to identify which method to use for the questions that matter to you, and to implement effectively.

Let’s get started.

Method 1: Controlled Regression

The idea behind controlled regression is that we might control directly for the confounding variables in a regression of Y on X. The statistical requirement for this to work is that the distribution of potential outcomes, Y, should be conditionally independent of the treatment, X, given the confounders, C.

Let’s say we want to know the impact of some existing product feature — e.g., live chat support — on an outcome, product sales. The “why we care” is hopefully self-evident: if the impact of live chat is large enough to cover the costs, we want to expand live chat support to boost profits; if it’s small, we’re unlikely to expand the feature, and may even deprecate it altogether to save costs.

We can easily see a positive correlation between use of chat support and user-level sales. We also probably have some intuition around likely confounders. For example, younger users are more likely to use chat because they they are more comfortable with the chat technology; they also buy more because they have more slush money.

Since youth is positively correlated with chat usage and sales, the raw correlation between chat usage and sales would overstate the causal relationship. But we can make progress by estimating a regression of sales on chat usage controlling for age. In R:

fit <- lm(Y ~ X + C, data = ...)
summary(fit)

The primary — and admittedly frequent — pitfall in controlled regression is that we often do not have the full set of confounders we’d want to control for. This is especially true when confounders are unobservable — either because the features are measurable in theory but are not available to the econometrician (e.g., household income), or because the features themselves are hard to quantify (e.g., inherent interest in the product).

In this case, often the best we can do within the controlled regression context is to proxy for unobservables with whatever data we do have. If we see that adding the proxy to the regression meaningfully impacts the coefficient on the primary regressor of interest, X, there’s a good chance regression won’t suffice. We need another tool.

But before moving on, let’s briefly cover the concept of bad controls, or why-we-shouldn’t-just-throw-in-the-kitchen-sink-and-see-what-sticks (statistically). Suppose we were concerned about general interest in the product as a confounder: the more interested the user is, the more she engages with our features, including live chat, and also the more she buys from us. We might think that controlling for attributes like the proportion of emails from us she opens could be used as a proxy for interest. But insofar as the treatment (engaging in live chat) could in itself impact this feature (e.g., because she wants to see follow-up responses from the agent), we would actually be inducing included variable bias. The take-away? Be wary of controlling for variables that are themselves not fixed at the time the “treatment” was determined.

Method 2: Regression Discontinuity Design

Regression discontinuity design, or RDD, is a statistical approach to causal inference that takes advantage of randomness in the world. In RDD, we focus in on a cut-off point that, within some narrow range, can be thought of as a local randomized experiment.

Suppose we want to estimate the effect of passing an ed-tech course on earned income. Randomly assigning some people to the “passing” treatment and failing others would be fundamentally unethical, so AB testing is out. Since several hard-to-measure things are correlated with both passing a course and income — innate ability and intrinsic motivation to name just two — we also know controlled regression won’t suffice.

Luckily, we have a passing cutoff that creates a natural experiment: users earning a grade of 70 or above pass while those with a grade just below do not. A student who earns a 69 and thus does not pass is plausibly very similar to a student earning a 70 who does pass. Provided we have enough users in some narrow band around the cutoff, we can use this discontinuity to estimate the causal effect of passing on income. In R:

library(rdd)
RDestimate(Y ~ D, data = ..., subset = ..., cutpoint = ...)

Here’s what an RDD might look like in graphical form:

Let’s introduce the concepts of validity:

Results are internally valid if they are unbiased for the subpopulation studied.
Results are externally valid if they are unbiased for the full population.

In randomized experiments, the assumptions underlying internal and external validity are rather straightforward. Results are internally valid provided the randomization was executed correctly and the treatment and control samples are balanced. And they are externally valid so long as the impact on the experimental group was representative of the impact on the overall population.

In non-experimental settings, the underlying assumptions depend on the method used. Internal validity of RDD inferences rests on two important assumptions:

Assumption 1: Imprecise control of assignment around the cutoff. Individuals cannot control whether they are just above (versus just below) the cutoff.

Assumption 2: No confounding discontinuities. Being just above (versus just below) the cutoff should not influence other features.

Let’s see how well these assumptions hold in our example.

Assumption 1: Users cannot control their grade around the cutoff. If users could, for example, write in to complain to us for a re-grade or grade boost, this assumption would be violated. Not sure either way? Here are a couple ways to validate:

Check whether the mass just below the cutoff is similar to the mass just above the cutoff; if there is more mass on one side than the other, individuals may be exerting agency over assignment.
Check whether the composition of users in the two buckets looks otherwise similar along key observable dimensions. Do exogenous observables predict the bucket a user ends up in better than randomly? If so, RDD may be invalid.

Assumption 2: Passing is the only differentiator between a 69 and a 70. If, for example, users who get a 70 or above also get reimbursed for their tuition by their employer, this would generate an income shock for those users (and not for users with lower scores) and violate the no confounding discontinuities assumption. The effect we estimate would then be the combined causal effect of passing the class and getting tuition reimbursed, and not simply the causal effect of passing the class alone.

What about external validity? In RDD, we’re estimating what is called a local average treatment effect (LATE). The effect pertains to users in some narrow, or “local”, range around the cutoff. If there are different treatment effects for different types of users (i.e., heterogeneous treatment effects), then the estimates may not be broadly applicable to the full group. The good news is the interventions we’d consider would often occur on the margin — passing marginally more or marginally fewer learners — so a LATE is often exactly what we want to measure.

Method 3: Difference-in-Differences

The simplest version of difference-in-differences (or DD) is a comparison of pre and post outcomes between treatment and control groups.

DD is similar in spirit to RDD in that both identify off of existing variation. But unlike RDD, DD relies on the existence of two groups — one that is served the treatment (after some cutoff) and one that never is. Because DD includes a control group in the identification strategy, it is generally more robust to confounders.

Let’s consider a pricing example. We all want to know whether we should raise or lower price to increase revenue. If price elasticity (in absolute value) is greater than 1, lowering price will increase purchases by enough to increase revenue; if it’s less than 1, raising prices will increase revenue. How can we learn where we are on the revenue curve?

The most straightforward method would be a full randomization through AB testing price. Whether we’re comfortable running a pricing AB test probably depends on the nature of our platform, our stage of development, and the sensitivity of our users. If variation in price would be salient to our users, for example because they communicate to each other on the site or in real life, then price testing is potentially risky. Serving different users different prices can be perceived as unfair, even when random. Variation in pricing can also diminish user trust, create user confusion, and in some cases even result in a negative PR storm.

A nice alternative to AB testing is to combine a quasi-experimental design with causal inference methods.

We might consider just changing price and implementing RDD around the date of the price change. This could allow us to estimate the effect of the price on the revenue metric of interest, but has some risks. Recall that RDD assumes there’s nothing outside of the price change that would also change purchasing behavior over the same window. If there’s an internal shock— a new marketing campaign, a new feature launch — or, worse, an external shock — e.g., a slowing economy — we’ll be left trying to disentangle the relative effects. RDD results risk being inconclusive at best and misleading at worst.

A safer bet would be to set ourselves up with a DD design. For example, we might change price in some geos (e.g., states or countries), but not others. This gives a natural “control” in the geos where price did not change. To account for any other site or external changes that might have co-occurred, we can use the control markets — where price did not change — to calculate the counterfactual we would have expected in our treatment markets absent the price change. By varying at the geo (versus user) level, we also reduce the salience of the price variation relative to AB testing.

Below is a graphical representation of DD. You can see treatment and control geos in the periods pre and post the change. The delta between the actual revenue (here per session cookie) in the treatment group and the counterfactual revenue in that group provides an estimate of the treatment effect:

While the visualization is helpful (and hopefully intuitive), we can also implement a regression version of DD. At its essence, the DD regression has an indicator for treatment status (here being a treatment market), an indicator for post-change timing, and the interaction of those two indicators. The coefficient on the interaction term serves is the estimated effect of the price change on revenue. If there are general time trends, we can control for those in the regression, too. In R:

fit <- lm(Y ~ post + treat + I(post * treat), data = ...)
summary(fit)

The key assumption required for internal validity of the DD estimate is parallel trends: absent the treatment itself, the treatment and control markets would have followed the same trends. That is, any omitted variables affect treatment and control in the same way.

How can we validate the parallel trends assumption? There are a few ways to make progress, both before and after rolling out the test.

Before rolling out the test, we can do two things:

Make the treatment and control groups as similar as possible. In the experimental set-up, consider implementing stratified randomization. Although generally unnecessary when samples are large (e.g., in user-level randomization), stratified randomization can be valuable when the number of units (here geos) is relatively small. Where feasible, we might even generate “matched pairs” — in this case markets that historically have followed similar trends and/or that we intuitively expect to respond similarly to any internal product changes and to external shocks. In their 1994 paper estimating the effect of minimum wage increases on employment, David Card and Alan Krueger matched restaurants in New Jersey with comparable restaurants in Pennsylvania just across the border; the Pennsylvania restaurants provided a baseline for determining what would have happened in New Jersey if the minimum wage had remained constant.
After the stratified randomization (or matched pairing), check graphically and statistically that the trends are approximately parallel between the two groups pre-treatment. If they aren’t, we should redefine the treatment and control groups; if they are, we should be good to go.

Ok, so we’ve designed and rolled out a good experiment, but with everyone moving fast, stuff inevitable happens. Common problems with DD often come in two forms:

Problem 1: Confounders pop up in particular treatment or control markets. Maybe mid-experiment our BD team launches a new partnership in some market. And then a different product team rolls out a localized payment processor in some other market. We expect both of these to affect our revenue metric of interest.

Solution: Assuming we have a bunch of treatment and control markets, we can simply exclude those markets — and their matches if it’s a matched design — from the analysis.

Problem 2: Confounders pop up across some subset of treatment and control markets. Here, there’s some change — internal or external — that we’re worried might impact a bunch of our markets, including some treatment and some control markets. For example, the Euro is taking a plunge and we think the fluctuating exchange rate in those markets might bias our results.

Solution: We can add additional differencing by that confounder as a robustness check in what’s called a difference-in-difference-in-differences estimation (DDD). DDD will generally be less precise than DD (i.e., the point estimates will have larger standard errors), but if the two point estimates themselves are similar, we can be relatively confident that that confounder is not meaningfully biasing our estimated effect.

Pricing is an important and complicated beast probably worthy of additional discussion. For example, the estimate above may not be the general equilibrium effect we should expect: in the short run, users may be responding to the change in price, not just to the new price itself; but in the long-run, users likely only to respond to the price itself (unless prices continue to change). There are several ways to make progress here. For example, we can estimate the effect only on new users who had not previously been served a price and so for whom the change would not be salient. But we’ll leave a more extended discussion of pricing to a subsequent post.

Method 4: Fixed Effects Regression

Fixed effects is a particular type of controlled regression, and is perhaps best illustrated by example.

A large body of academic research studies how individual investors respond (irrationally) to market fluctuations. One metric a fin-tech firm might care about is the extent to which it is able to convince users to (rationally) stay the course — and not panic — during market downturns.

Understanding what helps users stay the course is challenging. It requires separating what we cannot control — general market fluctuations and learning about those from friends or news sources — from what we can control — the way market movements are communicated in a user’s investment returns. To disentangle, we once again go hunting for a source of randomness that affects the input we control, but not the confounding external factors.

Our approach is to run a fixed effects regression of percent portfolio sold on portfolio return controlling (with fixed effects) for the week of account opening. Since the fixed effects capture the week a user opened their account, the coefficient on portfolio return is the effect of having a higher return relative to the average return of other users funding accounts in that same week. Assuming users who opened accounts the same week acquire similar tidbits from the news and friends, this allows us to isolate the way we display movements in the user’s actual portfolio from general market trends and coverage.

Sound familiar? Fixed effects regression is similar to RDD in that both take advantage of the fact that users are distributed quasi-randomly around some point. In RDD, there is a single point; with fixed effects regression, there are multiple points — in this case, one for each week of account opening.

In R:

fit <- lm(Y ~ X + factor(F), data = …)
summary(fit)

The two assumptions required for internal validity in RDD apply here as well. First, after conditioning on the fixed effects, users are as good as randomly assigned to to their X values — in this case, their portfolio returns. Second, there can be no confounding discontinuities, i.e., conditional on the fixed effects, users cannot otherwise be treated differently based on their X.

For the fixed effects method to be informative, we of course also need variation in the X of interest after controlling for the fixed effects. Here, we’re ok: users who opened accounts the same week do not necessarily have the same portfolio return; markets can rise or fall 1% or more in a single day. More generally, if there’s not adequate variation in X after controlling for fixed effects, we’ll know because the standard errors of the estimated coefficient on X will be untenably large.

Method 5: Instrumental Variables

Instrumental variable (IV) methods are perhaps our favorite method for causal inference. Recall our earlier notation: we are trying to estimate the causal effect of variable X on outcome Y, but cannot take the raw correlation as causal because there exists some omitted variable(s), C. An instrumental variable, or instrument for short, is a feature or set of features, Z, such that both of the following are true:

Strong first stage: Z meaningfully affects X.
Exclusion restriction: Z affects Y only through its effect on X.

Who doesn’t love a good picture?

If these conditions are satisfied, we can proceed in two steps:

First stage: Instrument for X with Z
Second stage: Estimate the effect of the (instrumented) X on Y

In R:

library(aer)
fit <- ivreg(Y ~ X | Z, data = df)
summary(fit, vcov = sandwich, df = Inf, diagnostics = TRUE)

Ok, so where do we find these magical instruments?

Economists often find instruments in policies. Josh Angrist and Alan Krueger instrument for years of schooling with the Vietnam Draft lottery; Steve Levitt instruments for prison populations with prison overcrowding litigation. Although good instruments in the real world can generate incredible insights, they are notoriously hard to come by.

The good news is that instruments are everywhere in tech. As long as your company has an active AB testing culture, you almost certainly have a plethora of instruments at your fingertips. In fact, any AB test that drives a specific behavior is a contender for instrumenting ex post for the effect of that behavior on an outcome you care about.

Suppose we are interested in learning the causal effect of referring a friend on churn. We see that users who refer friends are less likely to churn, and hypothesize that getting users to refer more friends will increase their likelihood of sticking around. (One reason we might think this is true is what psychologists call the Ikea Effect: users care more about products that they have invested time contributing to.)

Looking at the correlation of churn with referrals will of course not give us the causal effect. Users who refer their friends are de facto more committed to our product.

But if our company has a strong referral program, it’s likely been running lots of AB tests pushing users to refer more — email tests, onsite banner ad tests, incentives tests, you name it. The IV strategy is to focus on a successful AB test — one that increased referrals — and use that experiment’s bucketing as an instrument for referring. (If IV sounds a little like RDD, that’s because it is! In fact, IV is sometimes referred to as “Fuzzy RDD”.)

IV results are internally valid provided the strong first stage and exclusion restriction assumptions (above) are satisfied:

We’ll likely have a strong first stage as long as the experiment we chose was “successful” at driving referrals. (This is important because if Z is not a strong predictor of X, the resulting second stage estimate will be biased.) The R code above reports the F-statistic, so we can check the strength of our first stage directly. A good rule of thumb is that the F-statistic from the first stage should be least 11.
What about the exclusion restriction? It’s important that the instrument, Z, affect the outcome, Y, only through its effect on the endogenous regressor, X. Suppose we are instrumenting with an AB email test pushing referrals. If the control group received no email, this assumption isn’t valid: the act of getting an email could in and of itself drive retention. But if the control group received an otherwise-similar email, just without any mention of the referral program, then the exclusion restriction likely holds.

A quick note on statistical significance in IV. You may have noticed that the R code to print the IV results isn’t just a simple call to summary(fit). That’s because we have be careful about how we compute standard errors in IV models. In particular, standard errors have to be corrected to account for the two stage design, which generally makes them slightly larger.

Want to estimate the causal effects of X on Y but don’t have any good historical AB tests on hand to leverage? AB tests can also be implemented specifically to facilitate IV estimation! These AB tests even come with a sexy name — Randomized Encouragement Trials.

It’s no surprise that we invest in building predictive models to understand who will do what when. It’s always fun to predict the future.

But it’s even more fun to improve that future. Where, when, how, and with whom can we intervene for better outcomes? By shedding light on the mechanisms driving the outcomes we care about, causal inference gives us the insights to focus our efforts on investments that better serve our users and our business.

Today, we briefly covered a range of methods for causal inference when AB testing is off the table. We hope these methods will help you uncover some of the actionable insights that can move your company’s mission forward.

Comments? Suggestions? Reach out at emily@coursera.org and duncan@wealthfront.com. Just want to do fun stuff with us? We’d love to hear from you! Plus, we’re always hiring.

5 Tricks When AB Testing Is Off The Table

An applied introduction to causal inference in tech

Written by Emily Glassberg Sands