Published in


5 Tricks When AB Testing Is Off The Table

An applied introduction to causal inference in tech

  1. Controlled Regression
  2. Regression Discontinuity Design
  3. Difference-in-Differences
  4. Fixed-Effects Regression
  5. Instrumental Variables
    (with or without a Randomized Encouragement Trial)
fit <- lm(Y ~ X + C, data = ...)
RDestimate(Y ~ D, data = ..., subset = ..., cutpoint = ...)
  • Results are internally valid if they are unbiased for the subpopulation studied.
  • Results are externally valid if they are unbiased for the full population.
  1. Check whether the mass just below the cutoff is similar to the mass just above the cutoff; if there is more mass on one side than the other, individuals may be exerting agency over assignment.
  2. Check whether the composition of users in the two buckets looks otherwise similar along key observable dimensions. Do exogenous observables predict the bucket a user ends up in better than randomly? If so, RDD may be invalid.
fit <- lm(Y ~ post + treat + I(post * treat), data = ...)
  1. Make the treatment and control groups as similar as possible. In the experimental set-up, consider implementing stratified randomization. Although generally unnecessary when samples are large (e.g., in user-level randomization), stratified randomization can be valuable when the number of units (here geos) is relatively small. Where feasible, we might even generate “matched pairs” — in this case markets that historically have followed similar trends and/or that we intuitively expect to respond similarly to any internal product changes and to external shocks. In their 1994 paper estimating the effect of minimum wage increases on employment, David Card and Alan Krueger matched restaurants in New Jersey with comparable restaurants in Pennsylvania just across the border; the Pennsylvania restaurants provided a baseline for determining what would have happened in New Jersey if the minimum wage had remained constant.
  2. After the stratified randomization (or matched pairing), check graphically and statistically that the trends are approximately parallel between the two groups pre-treatment. If they aren’t, we should redefine the treatment and control groups; if they are, we should be good to go.
fit <- lm(Y ~ X + factor(F), data = …)
  • Strong first stage: Z meaningfully affects X.
  • Exclusion restriction: Z affects Y only through its effect on X.
  • First stage: Instrument for X with Z
  • Second stage: Estimate the effect of the (instrumented) X on Y
fit <- ivreg(Y ~ X | Z, data = df)
summary(fit, vcov = sandwich, df = Inf, diagnostics = TRUE)
  • We’ll likely have a strong first stage as long as the experiment we chose was “successful” at driving referrals. (This is important because if Z is not a strong predictor of X, the resulting second stage estimate will be biased.) The R code above reports the F-statistic, so we can check the strength of our first stage directly. A good rule of thumb is that the F-statistic from the first stage should be least 11.
  • What about the exclusion restriction? It’s important that the instrument, Z, affect the outcome, Y, only through its effect on the endogenous regressor, X. Suppose we are instrumenting with an AB email test pushing referrals. If the control group received no email, this assumption isn’t valid: the act of getting an email could in and of itself drive retention. But if the control group received an otherwise-similar email, just without any mention of the referral program, then the exclusion restriction likely holds.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store