Cluster-randomized experiments at Glovo

Oscar Clavijo
The Glovo Tech Blog
10 min readJan 20, 2022

Article written with @eduardotoce, Data Scientist at Glovo.

Abstract

We discuss the use of cluster-randomized experiments at Glovo, an experimental design useful for evaluating impact when randomization at the individual level is infeasible or when there are concerns about interference under individual-level randomization. We also discuss the analysis of these types of experiments, specifically how the special treatment allocation strategy used in them has to be taken into account when computing standard errors. We conclude that disregarding this structure in the analysis stage can lead to major overestimation of the preciseness of point estimators, and hence to an inflated type I error rate.

Introduction

One of the guiding principles of product development at Glovo is to start measuring the impact of product changes from the day the MVP is released. New features are designed with an eye toward positively impacting some KPIs and delivering a better experience for at least one of the three parties involved in our business: customers, partners, and couriers. The gold standard for impact evaluation is a randomized controlled trial, usually referred to as an A/B test in tech environments.

For example, suppose we have developed a new customer reactivation strategy that assigns a promo code to customers when they stopped using our app (churned) for at least four weeks. We want to test the impact that the promo codes have on customer reactivation, where churned customers are said to be reactivated if they make at least one order after using the promo code.

To evaluate this promo code reactivation strategy, we can run an A/B test that randomizes exposure to the promo code for churned customers. In this hypothetical experiment, both the exposure to treatment vs. control and the definition of the outcome metric are at the customer level. However, for many experiments of interest at Glovo, it is just not possible or practical to assign treatment at the same level of granularity at which we want to define the main outcomes of the experiment. In other words, the unit of randomization or diversion in our experiment does not match the unit of analysis.

Concretely, this situation arises when experimenting with city configuration parameters. These parameters include the times at which a city opens or closes, the level of saturation, which is defined as the ratio between number of active orders and number of available couriers, at which a city is blocked, the maximum distance allowed to travel per courier vehicle type, the available zones, among others. The treatment — changing one or more of the above parameters — is allocated at a city level, so that all orders within a given city are assigned to treatment or control. However, the main outcomes we track in these experiments are defined at the order level: the delivery time and a binary indicator of whether an order was canceled.

This type of experiment, in which treatment is assigned at an aggregate level, is called a cluster randomized experiment. In the example above regarding city configuration parameters, the clusters are the cities. See Fig. 1 for a better understanding of the difference between a fully-randomized vs. a clustered-randomized experiment design.

Fig. 1. Comparison of the randomization design in a fully-randomized vs a cluster-randomized experiment. Units that belong to the same cluster (colored-squared) are always assigned to the same group for the cluster case.

Cluster randomized experiments

In the example of city-level parameters, treatment is defined solely at the city level. This is the reason why we cannot allocate treatment at the same level of granularity at which we are defining the outcomes, namely, at the order level. However, there are many other scenarios where we are not able to allocate treatment at the same level at which we will analyze the outcomes of the experiments. In the following paragraphs we list two other such scenarios.

Cluster randomized experiments also come into play when analyzing the behavior of customers over time, where the treatment is applied at the customer level but the outcome is measured at the order level. Consider, for example, when customers contact Glovo to ask where their order is. This happens frequently when an order is late. Suppose we have designed a new status page that continuously reassures the customers that their order is being delivered, thus reducing their likelihood of contacting a support agent, thereby (in aggregate) lowering Glovo’s operational costs. This new page is supposed to influence the overall behavior of customers, and its effect on them might take a bit to build up. An experiment in which different orders done by the same customer are randomly assigned to the old or the new page won’t do, since customers would not be continuously exposed to the same page. Thus, this is a situation in which we have to assign treatment or control at a customer level, even if the outcome, customer contacts per order, is defined at the order level. This is another example of a cluster randomized experiment, in which the clusters are just the customers.

Cluster-randomized experiments are especially effective for mitigating network effects related to fraudulent behavior. Imagine we have developed a new feature that deters customers from engaging in promo code abuse. We know that there exist networks of customers within the same city that jointly take part in promo code abuse. If we randomly assign customers within a city to the treatment or control groups, there is a chance that customers exposed to the new feature communicate to customers in the control group, who are however part of the same network of fraudsters. Imagine the feature is effective at reducing fraudulent behavior. Then customers in the treatment group might warn customers in the control group that are part of the same fraud network of increased control in the use of promo codes on the part of Glovo. This in turn would bias our estimates of the average treatment effect (ATE) towards zero. This undesirable interference phenomenon could be at least mitigated if the randomization is done at a more aggregate level, such as at the city level.

Analyzing clustered-randomized experiments

The estimator of the ATE that is typically used in traditional (non-cluster) experiments, the difference-in-means, is still a reasonable estimator with good theoretical properties for clustered experiments. However, the usual estimator of the standard error of the difference in means, given by

where St and Sc are the sample deviations of outcomes in treatment and control, and Nt and Nc are the number of outcomes observed in treatment and control respectively, will be too optimistic. In fact, in experiments with large clusters and/or large intra-cluster correlations, the former estimator of the standard error will heavily underestimate the variability of the difference-in-means estimator, which we henceforth refer to as naive.

As we further discuss in the next section, the reason for the inadequacy of naive standard errors for cluster randomized experiments is that, in clustered experiments, differences in outcomes caused by differences in the clusters, or equivalently, correlations among outcomes within a cluster, cannot be explained by changes in the treatment, since there is no variation of the treatment within the clusters. This extra variance shows up in the difference-in-means estimator, and it has to be taken into account when estimating the standard errors.

Clustered standard errors

A key step in the estimation of the standard error of the difference in means is the estimation of the conditional variance-covariance matrix of the outcomes given the treatment, which we will denote with Ω. The naive estimator of the standard error of the difference-in-means that we reviewed in the last section is built on the assumption that outcomes are uncorrelated conditional on treatment, and thus that Ω is a diagonal matrix.

However, in cluster randomized experiments, if observations coming from the same cluster are correlated (and hence Ω is actually non-diagonal), the variance of the difference-in-means estimator can be heavily inflated compared to what it would be if there were no intra-cluster correlations (and if Ω were diagonal). Our estimators of the standard error of the difference in means have to take this into account. Otherwise, we run the risk of heavily overestimating the amount of information we are getting from our experiment, and thus of having a type I error rate a lot higher than the nominal one.

Clustered standard errors are estimators of the true standard error that address these issues by assuming a special structure for Ω. Specifically, they are based on assuming that Ω is a block-diagonal matrix, with blocks corresponding to different clusters. Thus, the only assumption is that outcomes corresponding to different clusters are uncorrelated. These estimators use the data to estimate the intra-cluster correlations. Implementations of clustered standard errors can be found in all major statistical software packages. We usually use the one provided in statsmodels.

Moreover, clustered standard errors have desirable properties for cluster randomized experiments, avoiding the pitfalls of naive estimators that we discussed earlier. We illustrate this in the following section. The interested reader can find more technical details on clustering of standard errors in [1].

Simulating an experiment

We ran a simulation study comparing the results of two possible analyses of a hypothetical cluster randomized experiment. The first analysis used clustered standard errors to build a t-statistic and test the hypothesis of no average effect at a 5% nominal level. Meanwhile, the second one used the usual “naive” standard error to build the t-statistic and test the null hypothesis. This simulation illustrated how non-zero intra-cluster correlations can lead to serious underestimation of the standard errors by the usual naive standard errors, and therefore major increases in the false positive rate of the corresponding hypothesis tests.

Our simulation is based on the problem of setting city-blocking saturation thresholds we described in the introduction, and it is based on real data. Using one month of orders made in 100 small and medium cities where Glovo is active, we repeated the following 10,000 times:

  • Half of the cities were randomly assigned to the treatment group, and the other half to the control group.
  • For cities in the treatment group, all orders that were made in them during the month in question had their customer delivery time (CDT) modified by adding to it a zero-mean normal random variable with a small variance. For cities in the control group, we did nothing.
  • We computed the difference in means estimator and two types of standard errors: the clustered ones and the usual naive one. We computed clustered standard errors using this method from the statsmodels library, using a clustered covariance type with clusters given by the cities. Using the two types of standard errors, we conducted two t-tests with a 5% nominal level and stored the results.

This simulation is an illustration of a hypothetical experiment in which:

  • Half of the cities have their threshold changed according to some strategy or model.
  • The null hypothesis of no average treatment effect holds.

Side comment. This is not really how we ran saturation thresholds experiments. We actually ran cluster switchback experiments, but this is beyond the scope of this post. Maybe next time.

Simulation results

We now present the results of the simulation. Recall that, because of how we generated the data, the null hypothesis of no average treatment effect holds. Thus, we would expect to reject the null hypothesis on approximately 5% of the 10,000 replications. Table 1 and Fig 2. present the results.

Fig. 2. Histogram of t-statistics obtained with each method, with a standard normal density function overlaid in red.

Since the null hypothesis holds, the distribution of both t-statistics should be approximately a standard normal distribution. We see in Figure 1 that the distribution of the t-statistics that use clustered standard errors are well approximated by a standard normal. But, even though the distribution of the t-statistics that use naive standard errors is bell-shaped, it is much more dispersed than what would be expected were they N(0,1) distributed. This shows that the naive standard errors are systematically underestimating the variance of the difference in means estimator.

Table 1. Type I error rates of the t-tests associated with each method to compute standard errors.

We see from Table 1 that the type I error rate for the t-test with naive standard errors is completely off, while the one for the t-test with clustered standard errors matches the nominal level (5%). Using the naive standard errors, we lost all control on the type I error rate of our procedure.

Conclusions

Cluster randomized experiments are a powerful tool for impact evaluation in the following cases:

  1. when randomization at the individual level is infeasible,
  2. when there are concerns on interference under individual-level randomization, or
  3. when treatment is only defined at a very aggregate level.

The special treatment allocation strategy used in cluster randomized experiments has to be taken into account when computing standard errors. We have discussed the use of clustered standard errors to this end, and shown through a simple simulation that disregarding the special structure of cluster randomized experiments in the analysis stage can lead to major overestimation of the preciseness of point estimators, and hence to an inflated type I error rate.

We should mention that there are subtleties and issues associated with cluster experiments. For one, cluster randomized experiments are typically inefficient, requiring a large number of clusters, as clustered standard errors only work well when the number of clusters in cluster randomized experiments is large.

One interesting aspect of cluster experiments that we did not discuss is how to do power analysis and sample size calculations. Closed-form formulas for computing sample sizes necessary to reach a target level of power in cluster experiments are available, but they usually rely on simplifying assumptions on the intra-cluster correlation structure. We prefer to base our sample size calculations on simulations, but this is beyond the scope of this post.

References

[1] Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. (2017). When should you adjust standard errors for clustering? (No. w24003). National Bureau of Economic Research.

[2] Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13–22.

Acknowledgements

We’d like to thank our colleagues that provided helpful suggestions and comments on previous drafts of this post: Ezequiel Smucler, Andy Kreek, Victor Bouzas, Daniel Canueto, Maxim Khalilov, Daniel Garrido.

--

--