An intro to Synthetic Controls in real life

Published in

The Glovo Tech Blog

10 min readSep 6, 2023

Abstract

This article aims to be a friendly introduction to synthetic controls, a famous technique in the causal inference world. We will give the theoretical foundations to start from scratch and a real use case from Glovo, a delivery company based in Barcelona, Spain. We introduce the idea of a donor pool and how to replicate a city using a loss function. Then we dive deeper into how to evaluate your estimations and how to calculate the significance of the results. We also give some tips and learnings from our own mistakes.

The article is directed at data scientists and data analysts, but also the broader public as there are not many conceptual requirements. You only need to be familiar with A/B testing, hypothesis testing, and some basic machine learning concepts such as overfitting, regularization, and loss functions.

Introduction

At Glovo, we are always developing new features and algorithms or improving the current ones. But how do we measure the impact of every change? The simplest way to measure it is to make a comparison between before and after. Look at the example below, where we implemented a new algorithm to rank the couriers and would expect more slots booked by performant couriers. Comparing the average metrics before the intervention and after it, we see a reduction of 2.5pp (37.8% — 40.3%). So, if we take this at face value, as an estimate of the treatment effect, can we conclude that the new algorithm is worse than the previous one?

This is, of course, an overly simplistic way of looking at the problem. It does not take into account the external factors or the trends that coexisted with the intervention. Maybe all neighboring cities were decreasing, or maybe there was a seasonal factor. Let’s compare Constanta to its neighboring cities:

This already gives us some perspective. We see that there’s a downward trend in most of them, and Constanta (CTA) managed to remain stable and even increase in the last weeks.

In a perfect world, this decision would be made after running an experiment, creating control and test groups, and comparing those who got the treatment with the ones who didn’t. Unfortunately, sometimes, it’s just impossible or impractical to run an experiment.

The metric we’re looking at is the % slots booked by performant couriers. We want this metric to be as high as possible because the best couriers provide the best service to our customers. It’s impossible to run an AB test at courier level randomization in the city because we can’t simply split the city in half, as one group would impact the other. A switchback test is not possible either, because changing the algorithm implied changes in the app and it would operationally be an overkill. Neither could we compare one city with another (matching or clustered randomization), because we wouldn’t have enough cities to achieve power. So, we were quite tight in terms of possibilities for experimentation.

Method

What we actually want to achieve is estimating what would have happened had we not implemented this new algorithm. The synthetic control method consists of building a fake city of Constanta (“synthetic”), where there was no intervention (this is known as the counterfactual). This fake city will basically be a weighted average of other cities that are similar to Constanta.

The weekly evolution of our target city could be described by a vector: Constanta= [ y₀, y₁, … yₖ, yₖ+₁, … yₙ ] where each component of the vector has the average metric on that week. The subscripts represent the weeks, k is the transition moment when the intervention happened, and N is the most recent week observed.

We can define this evolution analogously for every other city that did not implement the new algorithm. For example: Blagoevrad = [ y₀, y₁, … yₖ, yₖ+₁, … yₙ ]. We’ll call the set of vectors representing cities that did not implement the strategy our donor pool.

Now we want to construct a fictitious city that looks like Constanta during the period before we changed the ranking. To build it, we use the weeks before the change was implemented, so that we can think of it as assembling the control group. Take the first time period t=0: taking every city in our donor pool {BLG, FCS, SBU, BRV}, we want to pick w that will make the following approximation as accurate as possible:

Of course, if we only use one of our periods, we can always make this approximation exact. What we really want to do is to use the same w for the entire history of the donor city. If we compile all the donor information into a n cities x n periods matrix. This time our vector Weights (of length n cities) will result in a vector of the same length.

Now we want to find an appropriate set of weights to minimize the average of the deviations we get between the real Constanta and the synthetic Constanta. We want to find the correct vector weights that make the % Slots booked by performant couriers CTA — % Slots booked by performant couriers synthetic CTA as close to zero as possible. So it all comes down to fitting an OLS (ordinary least squares) to find this combination of weights. More formally, we will use weights w that minimizes:

Note that t ranges from 1 until k, which is the last period without intervention. We don’t want to use the period after the intervention to pick the weights w, but we will use the weights to construct the line after the intervention. Also, at the Donor pool part, we start with j=2 because j=1 is our target city.

Applying synthetic controls to our example

This will result in another vector, plotted below as the orange line.

Looking at the graph above we see a very different perspective, and the new weights seemed to have an important effect. Now looking at the difference between the lines:

Our conclusion is that the new algorithm did help the city to increase the % slots booked by performant couriers!

In this example, our fake CTA was created with the respective cities and weights:

The sum of weights is equal to 1. This is not a requirement but helps to avoid extrapolation. If you keep the weights completely free, the OLS might find a very extreme multiplication that fits well in the data pre-intervention, but does not predict the future well. In the end, you can think of this as a regularization strategy.

How to evaluate synthetic controls

The first evaluation you can do is check the differences between the real and synthetic controls in the pre-intervention periods. If the differences are big, you probably can’t trust the synthetic estimation after the intervention.

But there are more techniques you can do to ensure reliable results. One of them is to treat it as a simple prediction problem, and we want to train different models (different donor cities for example) and test it on unseen data. It comes down to splitting the data in train and test, such as the example below:

Using this approach we tested 3 different versions and selected the best to predict the counterfactual.

In the case above, we varied the number of pool cities, from 10 to 30. We see that by far the best MAE (Mean absolute error) is to limit the number of cities. This is the simplest way of testing, with a fixed train and test set. If you want to be more rigorous, you could use a time series cross-validation, following this structure.

Significance

In case you’re expecting some p-values to believe in this technique, here it goes. We can use Fisher’s Exact Test in these cases. This is especially important to confirm that results are statistically significant and not just due to random luck.

Fisher’s Exact Test

The idea behind Fisher’s Exact Test is to permute the treated and control exhaustively. Since we only have one treated unit, this would mean that, for each unit, we pretend it is the treated while the other cities are the control. So each line represents a city from the donor pool. Here, we would love to see the blue line close to 0 until the intervention, and then higher than the brown lines after the intervention. The resulting p-value is simply the number of cities with a higher effect than Constanta, divided by the total donor size. (9.7% of cases, p value = 0.097)

Wrap up

These are the points covered in this article and some action points for your next causal study:

Decide if synthetic control is the best method to use.
- If you can run a test, do it. When possible, running an experiment such as an A/B test is normally a better option.
Synthetic control beats pre-post
- Pre-post analysis normally compares two uncomparable numbers. It does not take into account the external factors or the trends that coexisted with the intervention.
Decide donors pool
- Pick units that are similar to the target one. In our case, we looked at neighboring cities with similar order volumes.
Use cross-validation to pick the best combination of cities.
Fit to get a point estimate
- Add some constraints to help regularize the model.
- The difference between the lines is the estimated ATE.
Perform a hypothesis test
- Fisher’s Exact Test gives an intuitive way of reading p-values.

Drawbacks

As nothing is perfect in life, synthetic controls also have some drawbacks. One of them is the case of changes in donor units after the intervention period. In our example, imagine that customers in one of the pool cities couldn’t order due to a bug in the app. This will impact the estimate in our target city, even though they’re completely uncorrelated. This also applies to special dates such as holidays. In summary, we should keep an eye on any non-parallel trend that might occur at the donor pool but not at the target (and vice versa).

As you’re modeling the counterfactual, the synthetic control approach gives the data scientist/analyst a lot of parameters to tune. This can lead to p-value hacking or data snooping, where you present results as statistically significant when in reality, there is no underlying effect. That’s the importance of a rigorous cross-validation and proper significance test.

Lessons Learned

Synthetic Libraries

Regardless if you’re a Python or R user, you will see plenty of packages with built-in functions that make all this process super smooth. However, I had the feeling that they work too much as black boxes and it’s sometimes difficult to understand what the models are actually doing. That’s why it could be preferable to build the regression using sklearn for example and create the synthetic data.

Building synthetics using more than 1 variable

So far, we’ve just shown you how to create a synthetic using only one variable. But in reality, you could use more variables and optimize for all at the same time. In this case, we just need to add an element to our formula and find the w that minimizes:

This sounds like a nice idea, but from a few real case examples, I saw the model loses fit as it needs to follow 2 criteria. So in my case, I prefer to be strict on the donor pool selection and use only one variable. If you need to evaluate another metric, you can always generate another synthetic city based on a different donor pool.

Acknowledgments

This article was written with the help of David Masip and Ezequiel Smucler, both of whom have been mentoring me on the causal inference adventures for a long time now.

We hope you enjoyed the reading — for any doubt or questions, feel free to write a comment below! If you are interested in experiment design, causal inference, or any other interesting challenges we face at Glovo, feel free to browse over our opportunities and apply!