Variance reduction in experiments using covariate adjustment techniques

David Masip Bonet
The Glovo Tech Blog
13 min readJan 10, 2023

This article was written by Victor Bouzas and David Masip, Data Scientists at Glovo.

Abstract

In this article, we discuss different covariate adjustment techniques that are routinely used to reduce variance in controlled experiments. We compare four different techniques — Multivariate Regression, CUPED (Controlled-experiment Using Pre-Experiment Data), CUPAC (Control Using Predictors as Covariates), and Doubly Robust — in theory, as well as in practice, through an experiment simulation. We highlight some important pitfalls to avoid when selecting a covariate to adjust for and show how these techniques can help companies, like Glovo, iterate faster. Using Doubly Robust techniques with pre-experimental data is a safe and efficient path that allows for the reduction of the variance of the treatment effect estimate in an unbiased manner in different scenarios.

Introduction

Having the ability to anticipate the effects of one’s decision on the company metrics is essential for any decision-maker. With that purpose, companies run thousands of controlled experiments per year (commonly known as AB tests). In order to infer conclusions from an experiment, the change in the target metric measured must be significantly higher compared to the variation of the metric that can happen during normal company operations. The challenge is that, over time, as your metric gets optimized, it becomes more difficult to improve the metric greatly, so you would need to be able to detect smaller impacts.

A simple solution to detect smaller changes is to increase the sample size, by exposing more users to the experiment, for example. This can be risky since you could be exposing them to suboptimal versions of your system. In addition, one of the most common ways of increasing sample size is running longer experiments, and if your experiments take longer to run, there are fewer tests you can run in a given time frame. An alternative solution is to reduce the variance in your treatment effect estimate through covariate adjustment methods.

In this post, we will explore the differences in the most common covariate adjustment methods, including a comparison simulation where some are better than others. We will also give examples of how we use variance reduction in Glovo.

Covariate Adjustment Techniques

There are many techniques to reduce the variance of your treatment effect estimate in an experiment using covariate adjustment. In this section, we will go over the theory behind some of those methods and discuss the limitations of each one.

The first method covered is the simplest one: multivariate regression. The second method is CUPED, that is, asymptotically, a particular case of regression where the covariate that is used is the pre-experiment mean of the outcome of interest. The third method is CUPAC, still a particular case of multivariate regression and a generalization of CUPED, where a machine learning model is used instead of a simple pre-experimental average.

Finally, we show how all these methods belong to the same family of estimators. Within this family, it can be seen that the most efficient estimator is the Doubly Robust one, which is the final estimation method we study.

Recall that the usual, unadjusted, estimator of the treatment effect is the difference in means estimator, defined as the difference in the means of the outcome Y for treatment (T=1) and control (T=0) arms:

Multivariate Regression

The first and simplest option that we have to include covariates in the treatment effect estimation is through multivariate regression.

We obtain an estimate of the average treatment effect (ATE) by regressing the outcome (Y) on an intercept, the treatment (T), and the baseline covariates (X):

The regression coefficient that goes with the treatment indicator is asymptotically unbiased for the ATE as long as the covariates are independent of the treatment, even if the regression model is incorrect (i.e. the relation between the outcome and the covariate is not linear). The idea behind adding covariates is “explaining out” variance that is not caused by the treatment. It can be shown that, as long as the treatment and control groups are of approximately the same size, the variance of the ATE estimate obtained by the OLS regression described above is, asymptotically, smaller than or equal to that of the usual difference in means estimator. The size of the variance reduction will depend on how much of the variance in the target metric is explained by the covariates included in the regression. Regression adjustments to reduce variance in controlled experiments go back to Fisher’s work in the 1930s.

A simple modification of the above method produces ATE estimators that are guaranteed to have a variance smaller than or equal to that of the difference in means estimator, regardless of the treatment assignment probabilities. It is obtained by running the following regression:

and using the coefficient that goes with the treatment indicator as an estimate of the ATE. This method is also guaranteed to have a variance smaller than or equal to that of the first OLS regression estimator we described.

A typical choice of a covariate is using the pre-experiment mean of the outcome at some granularity level. Using a covariate built with pre-experimental data ensures that it is independent of the treatment. For instance, if an experiment tries to measure the conversion rate of users, X can be the user-level average conversion rate in a pre-experiment period.

We will see how CUPED and CUPAC are, in a sense, particular cases of multivariate regression for covariates built using pre-experimental data. One of the main risks of using multivariate regression, or any regression adjustment technique, is that we may use covariates that add bias to our average treatment effect estimator. The covariates that we use should not be affected by the treatment; if they are, we risk adding bias to our estimator.

CUPED

Another technique commonly used in the industry is CUPED (Controlled-experiment Using Pre-Experiment Data), first introduced in 2013. It uses pre-experiment data from the outcome as a covariate to construct an unbiased and adjusted outcome metric.

CUPED is based on the fact that, if the difference in means of the outcome is an unbiased estimator of the outcome, the difference of means of the following adapted outcome:

is also an unbiased estimator of the outcome, as long as X is a covariate that is independent of the treatment, for all values of θ. Among the estimators in this family, the one with the least variance is obtained by taking θ as the ratio of the sample covariance and variance:

This is the slope of running a linear regression of Y on X. It follows that the CUPED estimator that uses the difference of means of the adapted outcome is asymptotically equivalent to estimating the ATE using multivariate regression, as described before.

Alternatively, CUPED can have another variation if you decide to have two θ’s, one for each treatment arm (e.g. control and treatment), which is equivalent to the regression with an interaction term that we have defined earlier. For a deeper discussion and proof of equivalence, you can check Tsiatis (2018) or the blog of one of the authors of the CUPED method. A limitation of this method is that it uses only the outcome pre-experiment data as the covariate, not including any other additional variable. On one hand, it protects against bias in your estimate, but also limits the variance reduction, not allowing you to introduce a good covariate that could reduce the variance further without introducing bias.

One advantage of using CUPED instead of the other methods is that it is very simple; it doesn’t need to fit a regression model or a more complex machine learning model. It can be done without using any external library, and can even be implemented easily in SQL.

CUPAC

Another recent technique explored in the industry leverages the predictive capabilities of Machine Learning models to reduce the variance of experiments. CUPAC (Control Using Predictors as Covariates) uses the output of an ML model to reduce estimator variance in comparison to using only pre-experiment values of the outcome, as in the original CUPED approach. This combines the possibility of adding many covariates with an arbitrarily complex functional form. If the relationship between the outcome and the covariates is non-linear, CUPAC can have higher efficiency than multivariate regression or CUPED.

Again, CUPAC is also a particular case of linear regression where the covariate is built using an ML model trained on pre-experimental data. We can also see CUPED as a particular case of CUPAC where the model we are using to predict is a very simple one, just giving the average of the outcome at some granular level.

CUPAC is the most complex method so far, but if we have good predictors of the outcome in a non-linear manner, it can be way better than the two previous methods in terms of efficiency in estimating the ATE. This can be achieved by fitting a boosted-trees algorithm like LightGBM to predict the outcome. The only risks associated with it are the longer fitting/training time, associated with using a machine learning model, and the possibility of adding bias to the estimate if the features used on the model are affected by the treatment.

Doubly Robust estimation

In the experimental setup, the Doubly Robust (original paper) estimator is a generalization of CUPAC that uses two regression models to predict the outcome (under control and treatment). The derivation of the doubly robust estimation method is similar to the CUPED one. Consider the following family of estimators of the ATE, indexed by the functions h1 and h0:

All the methods described above can be approximately written in this form for some specific pair of functions h1 and h0. Moreover, it can be shown that all reasonable estimators of the ATE can be expressed with the formula above and that all estimators belonging to this family are asymptotically unbiased.

The doubly robust method is obtained by finding the h function that gives the lowest asymptotic variance. Minimizing the asymptotic variance with respect to h1 and h0 gives us the following optimal h1 and h0:

In practice, we don’t really know either of the expected values. The usual way to estimate each of these terms is the following:

  • Split the treated data into 2 different folds. Train a model on fold 1 and a model on fold 2.
  • Use the model trained on fold 1 to estimate the expected value of the outcome of all data in fold 2, and do the reverse thing with fold 2.

This is how h1(X) is built. If x belongs to the data trained on fold 1, we use the model predicted on fold 2 to estimate the expected value of Y, and vice versa.

The same algorithm is run with non-treated data, so we end up training 4 ML models. Any complex model may be used to build a predictor of the outcome in the most accurate way.

We can see that the doubly robust is more complex than all the previous methods (we need to fit two machine learning models), but if it is only efficiency that matters, this is the best estimator.

Summary of adjustment techniques

To wrap up:

  • CUPED is just linear regression using a pre-experimental covariate.
  • CUPAC is a generalization of CUPED where the covariate we use is built using an ML model instead of just doing an average.
  • All reasonable estimators of the ATE (like CUPED, CUPAC, and linear regression) belong to a family of estimators parametrized by two functions, h1 and h0, and the most efficient estimator of this family is the doubly robust one.

Comparisons in practice

We performed some comparisons of these approaches under different scenarios using simulated data so that we could observe their limitations and capabilities. The notebooks to reproduce the results obtained here are available on GitHub.

We used HistGradientBoostingRegressor as the ML model for the CUPAC and Doubly Robust, but in practice, any supervised model can be used. The data-generating process we are going to use has four covariates interacting with the outcome, the first of those representing the outcome in past periods:

In all of these scenarios, there are 100,000 samples and the treatment (T) is randomized between the two arms with a fixed and homogeneous treatment effect of 0.1.

We simulated each scenario 1,000 times and got estimates of the ATE with the different techniques used. We report density plots for all estimators and each scenario.

Simple scenario

In this scenario, the outcome Y is a linear function of the covariates and some residual, such as below:

The DAG that encodes how these data are generated looks like this:

In the following figure, we can notice that almost all the methods perform similarly with respect to the bias: all provide unbiased estimates of the ATE. However, some of these estimates have more variance than others. Using no covariates does the worst possible in terms of the variance, while CUPAC, Doubly Robust, and Multivariate Regression performed similarly, with a much lower variance. The variance reduction for CUPED was limited because it only uses 1 covariate, which is the outcome before the treatment.

Non-linear scenario

In this scenario, differently from the previous one, the last and most impactful covariate is modified to have a non-linear effect on the outcome:

Just by doing that, we can notice in the plot below that CUPAC and doubly robust perform much better, because they are the only ones capable of capturing the non-linearity:

Wrong covariate scenario

In this last case, the same linear scenario from the first example is used, with:

However, instead of using the impactful covariates, we will use a randomly generated variable as a covariate with mean 0. So, the model we are fitting is:

And the coefficient of the treatment is the estimate of the ATE. As you can observe by the results and plot below, in this scenario all methods perform equally poorly, raising the importance of not only selecting a good technique but also paying attention to the covariates you use:

Applications in Glovo

Recommendation system

When testing a new recommendation algorithm in the marketplace, we may look at conversion rate, from session to order creation, as one of the main metrics to optimize for. If we want to measure the average treatment effect of the new recommendation algorithm, the most naive thing would be to run a regression using the treatment as a predictor and conversion as an outcome. However, we may add some of the following predictors to reduce the variance:

  • Customer historical conversion rate (take pre-experimental data and compute the average outcome per customer on the pre-experimental data); this should be a great predictor of the outcome.
  • Hour of the day, day of the week: some hours of the day and days of the week have naturally higher conversion rates than others.
DAG for the effect of a new recommendation algorithm in the conversion rate

Order assignment

When testing a new order assignment configuration, we may look at its average delivery time as one of the most important metrics of the test. Here are some covariates we can add to reduce the variance in this metric:

  • Average delivery time of the city (using pre-experiment data)
  • Distance from the pickup point to the delivery point. In some cases, we should not use this because it may be affected by the treatment, if the pickup point is changed by the treatment.
DAG for the effect of a new order assignment configuration method in the delivery time

Conclusions

Covariate adjustment can be used to decrease the error on your estimates of an AB test or have a similar error while reducing the length of your experiment. Covariate adjustment shines when we have good predictors of the outcome we are interested in measuring. A particular case of that is when we have the history of a user, and we may use the past behavior of the user to model future behavior.

We’ve seen that CUPED and CUPAC are both particular cases of adding covariates to the ATE estimation method. On a practical level, they outperform a simpler difference in means estimator when some of the variance of the outcome can be explained by the covariates. We’ve also seen examples where CUPAC outperforms CUPED, mainly because of a non-linear dependence between the outcome and the covariates. The doubly robust method obtained the best results among all the estimators in all the scenarios.

We’ve also seen that not all covariates are good, if the covariates used are not related to the outcome, no method is better than a simple difference in means estimator.

We hope you enjoyed the reading — for any doubt or questions, feel free to write a comment below! If you are interested in experiment design, causal inference, or any other interesting challenges we face at Glovo, feel free to browse over our opportunities and apply!

Acknowledgments

We’d like to thank our colleagues that provided helpful comments on previous drafts of this post: Ezequiel Smucler, Manuel Bolivar, Pablo Barbero, Alex Goldhoorn, and Maria Busmayer.

--

--