Accelerating Online Experiments that Target Quantile Treatment Effects
At Udemy, we run hundreds of A/B-style experiments per year on our experimentation platform. One of the biggest challenges we’ve faced is the need to run experiments faster to enable our product and engineering teams to rapidly ship more new product innovations to our customers.
One way we’ve accelerated experimentation at Udemy is by building variance reduction methods — such as CUPED — into our experimentation platform. Doing so has had a powerful impact, helping us reduce the average time to run some product A/B tests by 30% or more. However, variance reduction methods like CUPED are designed to work for experiments that target mean or average treatment effects. They are not designed to work with quantile treatment effects, such as the median or 90th percentile impact of an experiment.
In this short blog, we show how we’ve also reduced the variance of quantile treatment effects at Udemy, while still keeping the analysis practical and simple enough to build into our automated experiment analysis system.
Variance Reduction for Average Treatment Effects
Control function methods, like CUPED, are a type of variance reduction method used to increase the precision of average treatment effect estimates. They control for differences between the experimental units (e.g., site visitors, accounts) that could not have been the result of the change the experiment introduces. Removing this irrelevant variation in the outcome metric reduces noise in estimating the treatment effect.
Usually, these methods control for features that existed prior to the experiment being launched, such as total minutes a user spent learning in the 30 days prior to entering the experiment. Clearly, this feature is unrelated to the experiment because it predates the unit’s exposure to the experiment. But it likely explains a good portion of the variation in post-treatment minutes spent learning because, even in the absence of the experiment, people who learned more last month are likely to learn more this month. At Udemy, we use these methods to reduce experiment duration for average treatment effects.
CUPED-like methods use something like this regression model, usually with additional pre-experiment features as well:
Why Quantile Regression Doesn’t Work for Quantile Treatment Effects
But, in this blog, we want to know how an experiment affects a certain quantile of the distribution of the outcome metric. Perhaps, we suspect the median effect is very different from the average effect because the distribution of the metric is skewed or we are interested in reducing the 90th quantile of page load times or identifying the effect of an intervention on low-usage subscribers.
We might think that we could do a natural analog of the CUPED regression: instead of an ordinary least squares regression, why not do a quantile regression with similar control variables? Sadly, this doesn’t work.
Unlike expectations, quantiles aren’t linear. This means that the treatment effect we would estimate with a quantile regression is some weighting of the conditional quantile treatment effects. The population parameter is not the same as if we did not use control variables. We will also get different answers depending on the functional form we choose for the control variables even in large samples, unlike in average treatment effect estimation with CUPED.
Variance Reduction for Quantile Treatment Effects
So, what do we do?
We start with the following intuition: the quantile is the inverse of the cumulative distribution function and, evaluated at a particular value, the distribution function is just a mean. If we can reduce the variance of the estimated distribution function, then we can reduce the variance of the estimated quantile.
Suppose we wanted to answer a related question: for a given value of the outcome metric Y, say y, how does the probability the metric is less than y change as a result of the experiment? A standard average treatment effect answers this question so we can use a CUPED-like regression to reduce the variance of the estimate:
The quantile treatment effect is the difference in the given quantile p between the treatment group and the control group. Call the quantile for the treatment group y₁ and for the control group y₀. We know:
Subtract from both sides of the above to get:
The left-hand side above is ATE(y₁) which we will estimate via the above regression to get a variance-reduced version of it. So we have an equation to solve for y₁:
The solution to this problem if we had no control variables is identical to the sample quantile of Y in the treatment distribution, but with control variables, the estimate is different and, intuitively, because the ATE(y₁) term is estimated with a smaller variance following standard CUPED logic, we can estimate y₁ more precisely.
We repeat the same procedure for the Control part of the data, flipping Treatment and Control above, to get a variance-reduced estimate of y₀. We compute the variance-reduced quantile treatment effect (VR-QTE) as: y₁-y₀.
Confidence intervals
We use an asymptotic approach to construct confidence intervals because it is faster to compute than bootstrap or subsampling methods. The key idea is to represent the problem as a “Generalized Method of Moments” estimator with the following moment conditions corresponding to y₁:
And similar moment conditions for y₀. It is important to stack the equations for y₁ and y₀ into one system because the two estimates are correlated.
We can write the full system in a generic form:
Then, apply standard GMM formulas to get the asymptotic variance — more or less.
The one wrinkle is that in the standard formulation of the asymptotic variance formula for GMM, the expectation of the derivative of the moment expression shows up, i.e., G=E[Dm]:
But we have expressions like 1(Y ≤ y₁) in the moment condition function m which are not differentiable.
It turns out that we can switch the order of differentiation and expectation in this case (under some regularity conditions, see (Andrews, Chapter 37, Handbook) for details). So we compute the derivative of the expectation instead of the expectation of the derivative and use that for the “G” in the expression above, i.e., G=DE[m(θ)].
To compute this value, we need estimates of the density and the conditional CDF evaluated at y₁ and y₀. There are several ways to do this. We use a kernel density estimate with a rule of thumb bandwidth selection for the density estimates (for speed and because we are not in a low number-of-observations environment) and logistic regression for the conditional CDF evaluated at the particular quantiles (again, for speed and convenience). At least in our experience, reasonable methods here tend to produce similar results.
We can then construct standard confidence intervals from the standard error estimate and the asymptotic normality of the GMM estimator.
Simulation
We can see the method reduces variance without introducing bias in practice by doing a simple simulation.
The data generating process is:
Where (Y, X, Z) are observed, and E is unobserved heterogeneity in the treatment effect. The number of observations is 10,000. Y is the outcome metric, Z is the treatment indicator, and X is a control variable that is independent of Z.
We want to estimate the 25th quantile effect. The true value of the treatment effect at this quantile is about 1.85.
We consider three estimators: the simple quantile difference, the variance reduced quantile estimator, and a quantile regression of Y on Z and X (to demonstrate that this estimator does not estimate the same quantile effect). We generate the data and re-estimate 1000 times.
The simulation generated three main results:
- Variance-reduced estimator has no additional bias: the percentage difference in average estimate between the variance-reduced quantile estimator and the simple quantile difference is: 0.04%.
- Variance-reduced estimator has smaller variance: the percentage difference in variance across simulations between the variance-reduced quantile estimator and the simple quantile difference is: -21.60%!
- Quantile regression estimator is biased: The difference in the average estimate between the quantile regression estimator and the simple quantile estimator is: -9.30%, i.e., quantile regression is not a viable variance reduction strategy here.
Conclusion
In this post, we’ve shown how we apply CUPED-style variance reduction to quantile treatment effects, such as the median impact. In our simulations, we found that the variance reduction from using these methods can be sizable without introducing bias. Because the number of units needed for an experiment is roughly inversely proportional to variance, a 20% reduction would knock about a week off an otherwise four-week-long experiment. Over a year, instead of 13 four-week-long experiments on a specific product feature, we could run 17! One of the benefits of maintaining our own experimentation platform at Udemy is the ability to implement new methods like this one and scale them out quickly to our product and engineering teams — helping accelerate the pace of learning and innovation at Udemy.