Causal inference using synthetic controls

How to estimate causal effects using Machine Learning instead of A/B testing

Published in

Data Science at Microsoft

14 min readApr 11, 2023

Last year, Xandr became part of Microsoft, whose mission it is “to empower every person and every organization on the planet to achieve more.” In order to do that, we need to understand what makes every person and organization successful in the first place. We know that every change we make (to our products or algorithms) and any change our customers make (to their setups or campaigns) can have an effect, be it positive or negative. This is where causal inference comes into play. In this article, I describe how we approach some of these problems using Machine Learning.

Unfortunately, conventional Machine Learning methods typically suffer from a critical shortcoming when it comes to causal inference: They are designed to exploit correlations rather than causal relationships. This is often good enough for making predictions, but it is insufficient when it comes to understanding causes and effects (i.e., the why or what if). This is simply because correlation does not imply causation.

In many real-world cases, however, we actually find ourselves more interested in understanding a causal effect rather than correlations. Part of the reason is that causal effects are more robust because they do not suffer as much from model drift, and they are a great tool for supporting data-driven decisions.

This article is about one such method, called the synthetic control method (SCM). In short, its main idea is to use a synthetic control group instead of a randomized control group, which is what is typically done in A/B testing. As a result, SCM is significantly easier to apply in various use cases.

Before diving into a real-world example from Xandr/Microsoft Advertising and the ad tech world, let’s review the motivation behind synthetic controls and how they relate to other approaches.

Why use synthetic controls?

The gold standard in causal inference is the randomized control trial (RCT) — also referred to as online experiments or A/B tests — where a random subgroup is exposed to a “treatment” or “intervention” while the remainder acts as the control group. This ensures that any observed difference between the two groups can be due only to the treatment and no other (unobserved) variable.

A/B tests are difficult and expensive to implement, however, or are sometimes even infeasible due to technical or ethical reasons. Therefore, alternative methods are often more realistic in actual practice, and certainly more convenient. The synthetic control method can be used even when the intervention is applied to an entire population or group — and without the need for random assignments.

Such situations often occur naturally or passively in practice, for example, when an external shock makes an impact on the entirety of the group. But that means it is also possible to design an experiment around an active intervention, and such experiments are much simpler to realize than running a full-blown A/B test, as we will explore below.

A real-world use case from Xandr

Consider the following problem and its context.

Xandr runs a large exchange for selling and buying digital advertising space. As part of this process, we send millions (if not billions) of bid requests every day to potential buyers — you can think of them simply as the brands that want to buy ad space to serve their ad — who respond with their bids in real time.

We typically send out those bids using the OpenRTB (Real-Time Bidding) protocol, a standard in the advertising industry. The request contains information about the available ad space (we also call them placements) and what kind of response we are expecting. For example, it includes details like acceptable ad sizes (e.g., 250x300) and ad formats (e.g., banner or video), and it can also contain optional and custom fields.

The point is that the buyer’s bidding behavior may change depending on the information in the OpenRTB bid request. Say we are thinking of changing the ad size for a placement; we might see higher or lower bids, but it is difficult to say what will happen without running an experiment.

Such changes occur quite often due to the dynamic nature of the industry, which is why we would like to have a general method to help us estimate the revenue impact. This is the use case for the remainder of this article. The aim is to provide some solid intuition for the method discussed to help you apply it in similar scenarios of your own.

How (and how not) to approach the problem

Let’s consider only one specific placement (i.e., ad slot) for now, which we call our target unit. This makes the problem more manageable. Say we would like to make some change to an OpenRTB request but want to know what effect this will have on our revenue. How can we make sure we estimate the true causal effect and not some spurious correlation or noise?

First, here are some naive and generally unreliable methods that show what can go wrong in the absence of causal inference techniques.

Before versus after
A simple pre- versus post-intervention comparison is rarely a good idea. For example, we see A LOT of seasonality in our revenue, including weekly, monthly, quarterly — you name it — not to mention all the noise (see image below). We simply cannot say for sure that any observed change in revenue would be due to an intervention — a drop or increase could be due just to seasonality. These are not quite easy conditions for causal inference.

Daily revenue for a specific placement (i.e., ad slot) over time. Weekly seasonality is clearly recognizable, and the spike marks the end of Q2, when buyers typically spend all their remaining budgets for H1.

Across groups
We could compare the target placement to another, similar placement. If the target spikes or drops but not the control, this may be due to the intervention. However, comparisons across groups or units can be misleading for similar reasons — for example, different placements may follow different seasonalities, meaning they are not necessarily comparable. In other words, it is usually very difficult to determine which ad slot would make for a reliable control group.

Here are some better methods that are based on causal inference.

A/B testing
In an ideal world, we would run an A/B test by sending the new request exactly half the time at random and measuring the difference between the two groups. Let’s assume that this is not an option for us, as it is quite tedious to implement, which is often the case in practice.
Natural experiment
If we cannot design our own experiment, we can also look for “natural” experiments that may occur. Say, for example, that we can find two truly identical ad slots that always appear on the same webpage, one on the left side and one on the right side, so that they are equivalent in every other respect. We can then send the new request via one of those two ad slots and the original one via the other and thereby essentially run an A/B test.
Synthetic control method
Even if there is no single ad slot that makes for a good control group, some are certain to share some similarities with the target such that we can learn from them. In fact, by finding a weighted combination of placements, we might be able create a synthetic control group that during pre-treatment resembles the target very closely. Then, if the two groups diverge after the intervention is applied, the difference must be due to the intervention.

In other words, the main idea of the synthetic control method is that even if there is not one “good enough” control unit, as long as we have some similar units, we can learn from a synthetic control group that is better than each of those units by themselves.

How to design the experiment

Conducting the experiment involves the following four steps. These are quite universal and apply to other scenarios, too.

Identify relevant units
Identify the treatment/target unit as well as a number of “similar” units, which make up the so-called donor pool.
Perform the intervention
Perform the desired treatment/intervention on the target unit only.
Fit the model
Train the method on the pre-intervention period, which creates the synthetic control group.
Interpret the results
Observe the pre-intervention and post-intervention fits; the latter shows you the causal effect.

Here’s what this process looks like in practice, step by step.

Step 1: Identify relevant units

For the experiment, we partnered with one of our publishers (i.e., a seller of ad space). We chose a specific ad slot on one of their main pages that features a relatively high daily volume (on the order of several million views a day).

In a first step, of all the other placements that the publisher has (which probably number in the tens or even hundreds), we first identified a subset of around ten placements that are “similar” to our target (because tens or hundreds would be too many). These placements make up the donor pool.

What “similar” means is highly dependent on the problem, so domain knowledge is essential here. For example, we probably want to choose ad slots that appear on the same or on related websites and that serve the same ad type (e.g., image or video) as we would expect them to be more similar and thus better control units than completely unrelated ones. One thing to keep in mind, however, is that there should be no spill-over effects among the units.

In general terms, what to look for is units that show some co-movement or parallel behavior with the treated unit with regard to the target metric, which in our case here is revenue. It is this co-movement that the synthetic control method exploits, as you will see.

Daily revenue for the chosen target placement (in blue) and for the placements in the donor pool (in grey). Revenue numbers were standardized to account for different volumes.

The placements all clearly share a similar seasonality (see the plot above), although they do not have to move strictly in parallel — this would only be a requirement for the Difference in Differences (DiD) method, which can be thought of as a predecessor of SCM.

Even now, however, it is still not easy to see which of the placements would make for the best control group, or how to combine them to create a good control group. Luckily, we don’t have to, because this is exactly what the synthetic control method is designed to learn from the data. It will find a weighted combination of the units in the donor pool to create an optimal synthetic control group.

Step 2: Perform the intervention

We implemented the actual intervention (i.e., sending a new bid request) on July 25. We used a couple of weeks prior to that as the learning period, and the two weeks afterward until August 7 as our experiment or post-treatment period. After this, we reversed the intervention and analyzed the results.

Note that the intervention does not have to be an active one, as I’ve already mentioned. It may happen (or may have already happened) naturally, but you can still analyze the causal effect as if it were your own experiment.

Step 3: Fit the model

To run the synthetic control model and analyze the results, we used R’s Synth library. Apologies to all Python purists, but the package is easy to use and very capable. Note that we are not in the world of big data here, but rather utilize a well-defined, small-to-medium-scale experiment to uncover a causal truth. And for this, R is a great tool.

A look under the hood

So how does the synthetic control method work? In this subsection, I provide some of the slightly more technical details, which you’re free to skip if you want.

This is the only math you are going to see in this article, and though minor it’s important as it defines the synthetic control method.

It says that the synthetic control estimate Y-hat is a weighted combination of the i units in the donor pool, and that the estimated causal effect tau-hat is the difference between that synthetic control estimate and the observed Y of the target.

The weights are usually restricted to be non-negative and sum to one. The reason behind this restriction is that the solution can then be only an interpolation of the units. This is different from other ML methods like linear regression, which perform extrapolation. Interpolation means that the synthetic control group is prevented from deviating too much from the units in the donor pool into regions where we don’t have any data points. That being said, however, sometimes it can make sense to ease this restriction and allow extrapolation (keyword interpolation bias).

Apart from that, the restriction also gives rise to one of the major advantages of the synthetic control method: It leads to a sparse solution, meaning that some weights will go to zero, somewhat similar to lasso regression if you are familiar with it. In contrast to the lasso, however, it will be sparse in two ways: with respect to the features and with respect to the units in the donor pool. Because of that, the synthetic control method is extremely easy to interpret, which is often incredibly useful in practice.

Note that the equation above does not include features, but in fact the method can utilize features in addition to the target metric. As in other ML tasks, the features are very important, but they are not strictly necessary for the synthetic control method (at least for the sake of understanding it), and they serve a slightly different purpose than they do in classical ML models. Specifically, they are not only meant to be predictive of the target metric, but they also help the method determine what units to rely on based on how “similar” their features are to those of the treatment units. For example, the number of ads shown would be a good feature as it is clearly predictive of revenue (because more ads mean more revenue), but also features such as ad sizes could be helpful as they may vary among placements, and this helps the model find the ones that are most similar and relevant.

Step 4: Interpret the results

We can now fit the SCM on the training data (i.e., the period before the intervention), the results of which are shown in the image below. As a reminder, we are trying to find out whether making a specific change to outgoing bid requests has had an impact on revenue.

First, it is important to look at how well the synthetic control group represents the target unit during the pre-intervention period (see image below). A good fit indicates that the synthetic control represents the target well. If not, we cannot trust the control group, which means we should go back to improving the model and/or the data.

Revenue of target unit and synthetic control. The period to the left of the vertical line (representing July 25) is used as the learning period to find the synthetic control group. The fit is not perfect but acceptable. All data points to the right of the line are from during the experiment. There is no noticeable deviation between the two groups, which indicates there is no causal effect.

You can see that the alignment between the target and the synthetic control group is rough but quite decent throughout the pre-treatment period. It follows the seasonality and trends of the target unit quite well. Given the large amount of variance we commonly see, this may be as good as it gets, although there is certainly still room for feature engineering too.

During the experiment (i.e., during the post-treatment period) the two groups stay well aligned and do not diverge, indicating that there is no causal effect between the changes in the bid request and revenue.

Finally, if we investigate the weights of the solution, we can see that the synthetic control group is a combination of only two of the units in the donor pool, while all other units have received a weight of zero.

Unit weights found by the synthetic control method. Only two of the units have weights that are non-zero. Note that they sum to one, which makes interpretation quite easy.

Conclusion

The conclusion we draw from the experiment is that the adjustment we made to the bid request did not lead to a change in revenue. In fact, these results correspond to the results of an additional natural A/B test we performed in parallel to verify our results. While an increase would have been even better, the fact that we did not see a drop is a positive result for us too as the change resulted in fewer invalid bids and thus less computation and expended resources, which is great for the environment.

In addition to this use case, we have come across countless scenarios at Xandr/Microsoft Advertising for which the synthetic control method could be a great solution. Many changes that we or our customers make on our platform naturally affect an entire unit — e.g., a placement, a publisher, or a buyer — and this method can help provide an estimate of the causal effect that these changes have on key metrics without having to run a full-blown A/B test. The same is true when it comes to passive or external events that affect one group, which happens a lot in the real world too (for example, a new law that is implemented in one country). It can also be used to roll out a new feature to only one client before including the rest. The key is to have some “similar” units and some historical data from which to learn as a synthetic control group.

Causal inference is a booming field for good reasons. Many tech companies — including Xandr and Microsoft — are successful in part due to their ongoing experimentation. A/B testing will always be the gold standard, but it is not always feasible or worth the effort in practice. The synthetic control method presents an alternative that requires substantially less effort.

Of course, the method has its limitations too. For example, it is sometimes recommended for use only if the pre-treatment fit is almost perfect. As a result, it has been used primarily to study long-term metrics such as GDP or unemployment, which fluctuate much less. In fact, it is unlikely that the method can detect effects that are small in comparison to the noise around them. So, whether daily placement-level data with all its noise represents an ideal condition is open to discussion, but so is the claim that the pre-treatment fit must be perfect. The point is that whether or not the method is the right tool also depends on the magnitude of the effect that you would like to detect and the amount of noise in the data, and I hope this article helps you determine that.

To summarize, this is how the synthetic control method works in four steps:

Identify the treatment unit and a number of similar units that make up the donor pool.
Perform the treatment/intervention (unless it occurs or occurred passively anyway).
The method identifies a weighted combination of the units that acts as a synthetic control group.
If the pre-treatment fit is good, we can trust that the control group is a good counterfactual even for the post-treatment period (but no shocks other than the treatment should occur, of course).

These are the basics of the synthetic control method, but there have also been many improvements and extensions more recently, for example a way of incorporating multiple treated units.

I hope you have enjoyed reading this article. Please leave a comment below or message me on LinkedIn (Andreas Stöffelbauer) if you have any thoughts or questions.

Special thank you to Casey Doyle, Romain Quéré, and Lizzie Elliot for their review and feedback on the article.

And finally, here are some helpful additional resources:

Alberto Abadie (2021): Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects (link)
Synth R Package: Synthetic Control Group Method for Comparative Case Studies (link)
Causal Inference for the Brave and True, Chapter 15 Synthetic Control (link)