Sitemap

In today’s data-driven world, companies are constantly running experiments to improve their products and services. Traditional A/B testing is widely used to measure the impact of changes, but it can fall short in complex environments where interventions spill over between groups or external factors skew results. This is where synthetic control comes into play.

Synthetic control is an advanced statistical technique that allows for the precise estimation of causal effects, even in the presence of confounding factors like trends or seasonality. It’s a powerful tool actively used by leading companies such as Google, Microsoft, Meta, Amazon to run experiments at scale. These tech giants leverage synthetic control to evaluate interventions when randomized controlled trials are either impractical or expensive, making it an essential strategy for optimizing product performance and making data-driven decisions.

This article will dive into what synthetic control is, why it’s needed, how it works, and the potential drawbacks to consider when applying it in real-world scenarios. Whether you’re a data scientist, product analyst, or product manager, understanding synthetic control can help you run more effective experiments and draw better insights from your data.

By the end, you’ll have a clear understanding of whether synthetic control is the right fit for your product experiments — and how it can elevate the quality of your insights.

Why is Synthetic Control Needed in A/B Testing?

A/B testing is widely used to measure the impact of a product change or intervention, but sometimes it falls short due to unintended consequences, especially when external factors affect the control group. A great example of this can be seen in ride-hailing companies like Uber, where an intervention in the driver app might not only affect the treatment group but also spill over into the control group — leading to biased results.

Imagine Uber wants to test whether “showing the destination of a ride request” to drivers before they accept the trip decreases post acceptance cancellation and increases revenue per driver (assuming long rides pay better). In this test, the treatment group sees the destination, while the control group does not. Drivers in the treatment group, knowing where the rider is headed, might be more likely to accept longer trips, which typically yield better earnings. As a result, drivers in the control group may end up with more short-distance trips, offering fewer earning opportunities.

Here’s where things get tricky: because the drivers in the treatment group are cherry-picking long-distance rides, the control group is left with a skewed set of trips — mostly short ones. These drivers would have earned more had there been no intervention at all, leading to biased estimates of the intervention’s true impact. What’s worse, this spillover effect distorts the control group’s earnings potential, making it difficult to measure the real difference between the two groups. As a result, the A/B test fails to capture the true impact of the intervention.

Press enter or click to view image in full size
Biased Results due to the Spillover Effect

Why “Before and After” Analysis Falls Short

One might be tempted to use a before and after analysis to measure the effect of showing ride destinations to drivers. However, this approach has its own set of pitfalls. In fast-moving markets like ride-hailing, external factors such as seasonality, market trends, or competitor actions can heavily influence driver behavior and rider demand. For instance, demand may naturally fluctuate depending on the time of year, ongoing promotions, or even traffic patterns. Because of this, comparing metrics from a period before the intervention to a period after won’t give an accurate measure of the effect — it’s impossible to know if the change is due to the intervention or these external factors.

The Geo-Testing Approach

One common solution to address these issues is geo-testing, where the treatment and control groups are divided geographically. For example, Uber might roll out the change in City A (treatment group) and leave City B (control group) untouched to measure the difference in average earning per driver. While this can avoid spillover effects, geo-testing introduces its own challenges.

For geo-testing to work, you need two cities that behave very similarly in terms of rider demand, trip lengths, and driver behavior — an extremely difficult task. Different cities have different traffic patterns, customer demographics, and economic conditions, making it challenging to find a true match. What works in City A might not apply to City B, and any differences in performance might be more about city characteristics than the intervention itself. This limits the accuracy of geo-testing as a reliable method.

Why Randomization Across Cities Isn’t Always Feasible

Another approach could be to randomize the intervention across multiple cities, spreading the treatment and control groups across diverse regions to minimize the risk of external influences. This method, while reducing bias, can be a costly and complex intervention to run, especially for large-scale tests. Randomizing across cities requires extensive planning, a higher level of operational complexity, and significant financial investment — making it impractical for many companies to implement right from the start.

Thus, the need for a better approach becomes clear. A method that can control for these external factors while avoiding costly and logistically challenging interventions is where synthetic control comes in, offering a more reliable, data-driven solution to many of the limitations mentioned above.

How Does Synthetic Control Work?

Building on the ride-hailing example, where Uber tests the impact of showing ride destinations to drivers in one city, synthetic control offers a method for constructing a reliable comparison group. Instead of relying on traditional A/B testing or geo-testing, a synthetic control is created for the city where the intervention occurs, allowing for a more accurate measurement of the intervention’s effect.

The Core Concept of Synthetic Control

In synthetic control, the objective is to build a synthetic version of the treated city using a weighted combination of other cities that did not receive the intervention. This synthetic control group is designed to closely mimic what would have happened in the treated city if the intervention had not taken place. By comparing the outcomes of the treated city with the synthetic control after the intervention, the effect of the intervention can be isolated more effectively.

Step-by-Step Explanation

Consider the intervention: Uber decides to show ride destinations to drivers in City A. Instead of comparing City A to just one control city, several cities are used to construct a synthetic control.

  1. Selecting the Donor Pool: A group of cities that did not receive the intervention is chosen, referred to as the donor pool. These cities (B, C, D, and others) serve as potential contributors to the synthetic control.
  2. Time Series Data: Data is collected for each city in the donor pool during the same pre-intervention period. Each city is treated as a column, and the rows represent time periods (e.g., weeks or months) before the intervention. The same data is gathered for City A to ensure that pre-intervention trends are captured.
  3. Constructing the Synthetic Control: The goal is to create a weighted combination of cities from the donor pool that closely matches City A’s behavior before the intervention. For example, the synthetic control might be: Synthetic Control = 0.5* City B + 0.3 * City C + 0.2 * City D. Each city in the donor pool is assigned a weight based on its similarity to City A during the pre-intervention period. The idea is to replicate the pre-intervention behavior of City A as closely as possible.
  4. Post-Intervention Comparison: After constructing the synthetic control, the performance of City A after the intervention (e.g., average revenue per driver) is compared to that of the synthetic control. Any differences between the two are attributed to the intervention, as the synthetic control reflects what would have happened in City A without the treatment.
Press enter or click to view image in full size

Difference from Traditional Linear Regression

While synthetic control may sound similar to traditional linear regression, there are key differences:

  • In linear regression, cities are typically treated as rows, and various characteristics (such as population, weather, or trip length) are treated as columns. The model assigns a coefficient to each variable to predict an outcome.
Press enter or click to view image in full size
Traditional Linear Regression
  • In synthetic control, the cities themselves are treated as columns, and the time periods before the intervention serve as rows. The aim is to assign weights to cities (columns) to express the treated city as a linear combination of other cities over time. Instead of predicting an outcome based on variables, the method builds a synthetic version of the treated city based on the trends of the donor cities.
Press enter or click to view image in full size
Regression for Synthetic Control

The goal of regression in this context is to create a combination of donor cities that closely matches the outcome metrics of the treated city before the intervention. This is done by finding the right weights for each city, similar to how regression finds coefficients for variables.

The objective is not just to combine cities but to minimize the loss function — the difference between the actual outcome in the treated city and the weighted combination of outcomes from the donor cities. Specifically, a loss function is used to measure how well the synthetic control replicates the treated city’s pre-intervention behavior. By minimizing this loss, the synthetic control becomes as close as possible to what the treated city would have looked like if there had been no intervention.

Press enter or click to view image in full size
Linear Regression for Synthetic Control

The above equation shows how to combine cities in the control group by taking a weighted average of their outcomes. Specifically, Yt,pre represents the outcome of the treated city during the pre-treatment period, beta refers to the coefficients (or weights) that need to be learned, and Xc,pre represents the outcomes of the control cities during the pre-treatment period. The goal is to minimize the loss function to create a synthetic control that closely resembles the treated city before the intervention. This process ensures that the synthetic control behaves similarly to the treated unit, forming a valid counterfactual for comparison post-intervention.

Using Restricted Weights: Avoiding Overfitting

One common variation of synthetic control involves the use of restricted weights. Without restrictions, the model may overfit the data by assigning disproportionately high weights to one or two cities, leading to unrealistic or biased synthetic controls. For example, the model might heavily weight City B while under weighting others, making the synthetic control less accurate and potentially biased.

To avoid overfitting, constraints are applied to the weights. These constraints ensure that the synthetic control remains balanced and interpretable. For instance, non-negative weights may be enforced to prevent any city from receiving a negative contribution, and the sum of weights across different cities (donor units) must be 1. This ensures that the synthetic control reflects a realistic blend of cities, rather than being overly influenced by a single location.

Press enter or click to view image in full size
Linear Regression for Synthetic Control with Restricted Weights

Using restricted weights helps produce a synthetic control that is more robust and easier to interpret, reducing the risk of overfitting while providing clearer insights into how the cities in the donor pool contribute to the outcome.

This method offers a more accurate alternative to traditional A/B testing and geo-testing by creating a synthetic control that mimics the pre-intervention trends of the treated city, thus providing a clearer measure of the intervention’s effect.

Statistical Significance in Synthetic Control

After constructing the synthetic control and observing the difference between the treated unit (e.g., City A) and its synthetic counterpart post-intervention, it’s crucial to determine whether this difference is statistically significant. In other words, is the observed effect due to the intervention, or could it have occurred by chance? This is where randomization and p-value calculation come into play.

Randomization and the Placebo Test

To assess the statistical significance of the results in synthetic control, a common approach is the placebo test through randomization. The basic idea is to repeat the synthetic control process for other cities that did not receive the intervention, as if they had. This helps evaluate how often the observed effect could occur purely by chance.

Here’s how the process works:

  1. Synthetic Control for Non-Treated Units: After observing the difference between the treated city (City A) and its synthetic control, the same method is applied to other cities in the donor pool that didn’t receive the intervention. For each of these cities, a synthetic control is constructed using the remaining donor cities.
  2. Simulating the Intervention for Control Units: For each of these non-treated cities, a placebo intervention is introduced at the same time that the real intervention occurred in City A. The difference between these cities and their respective synthetic controls after the placebo intervention is calculated in the same way it was done for City A.
  3. Distribution of Placebo Effects: After calculating the post-intervention differences for the placebo cities, a distribution of these placebo effects is created. This distribution represents the range of differences that could have occurred by chance in cities that did not receive the actual intervention.
  4. Calculating the p-value: The p-value is then calculated by comparing the real effect observed in the treated city (City A) to the placebo distribution. Specifically, the p-value represents the proportion of placebo effects that are equal to or larger than the observed effect in City A.

p-value = Number of placebo effects greater than or equal to observed effect/Total number of placebo effects.

A small p-value (typically below 0.05) suggests that the observed difference is unlikely to have occurred by chance, meaning the intervention likely had a statistically significant impact. If the p-value is large, it indicates that the observed effect could easily have happened by chance, and the results are not statistically significant.

Press enter or click to view image in full size

How Iterating Over Different Cities Helps Calculate p-values

This process of iterating over different cities helps understand the p-value calculation. By repeating the synthetic control process for cities that did not receive the intervention, it’s possible to simulate what could have happened in a world where no intervention occurred. These placebo cities create a baseline for comparison, showing the natural variations in the outcome metric across cities without the treatment.

When the true effect for the treated city (City A) is compared against this baseline distribution, it becomes clear how exceptional or common the observed effect is. If the treated city’s effect stands out significantly from the placebo effects, it indicates that the intervention likely had a real impact. This method grounds the p-value calculation in a real-world context, making it easier to see whether the observed result is truly meaningful or could simply arise by chance.

Interpreting the Results

The randomization-based placebo test provides a robust way to assess whether the intervention truly caused the observed effect. By simulating the intervention across multiple non-treated cities, it becomes possible to account for natural variations in the data and establish a threshold for significance.

For example, if City A shows a large difference between its actual post-intervention metrics and the synthetic control, and only a few placebo cities exhibit differences of similar magnitude, the p-value will be low, supporting the hypothesis that the intervention was effective. Conversely, if many placebo cities show large differences similar to the treated city, the p-value will be high, indicating that the observed effect could easily have occurred by chance.

Advantages of Randomization for Significance Testing

The randomization-based approach offers several advantages:

  • Non-parametric: This method doesn’t require strong assumptions about the distribution of the data, unlike many traditional statistical tests. It works directly with the observed data and provides an empirical measure of significance.
  • Flexibility: The placebo test can be applied in various contexts, regardless of the specific metrics being measured, making it a versatile tool for evaluating the significance of synthetic control results.
  • Intuitive Interpretation: The comparison between the treated city and the placebo cities makes it easier to understand whether the observed effect is truly exceptional or something that could occur by chance.

Potential Drawbacks of Synthetic Control

While synthetic control is a powerful and innovative method for estimating causal effects in A/B testing, it comes with certain limitations that should be considered. These drawbacks can affect both the practicality and accuracy of the approach in real-world scenarios.

  • Availability of Suitable Donor Units: One of the primary challenges with synthetic control is the need for a robust donor pool of control units (e.g., cities, regions, or other test groups). If the donor cities do not closely resemble the treated city in terms of pre-intervention trends and other relevant characteristics, it becomes difficult to construct an accurate synthetic control. This can result in biased or unreliable estimates. In cases where there are not enough similar cities to create a balanced synthetic control, the method may fail to produce meaningful results.
  • Overfitting: While synthetic control typically uses weights to combine cities from the donor pool, there is a risk of overfitting — especially if the algorithm assigns disproportionately high weights to a small subset of cities that happen to match the treated city’s pre-intervention behavior. This can lead to a synthetic control that is overly tailored to the pre-intervention data but does not generalize well to the post-intervention period. Although constraints on weights (such as limiting weights to be non-negative) can reduce overfitting, this issue still poses a challenge, particularly in complex cases where the number of control units is limited.
  • Sensitivity to Changes in the Donor Pool: Another limitation of synthetic control is its sensitivity to shifts or changes in the donor pool over time. The method relies on the assumption that the relationship between the treated city and the donor cities remains consistent from the pre-treatment to the post-treatment period. However, any significant changes in the donor cities — such as economic shifts, policy changes, or external shocks — can lead to incorrect estimates. If the donor cities start behaving differently compared to the pre-treatment period, the synthetic control constructed from their outcomes may no longer accurately represent the counterfactual scenario, resulting in biased or unreliable treatment effect estimates.
  • Inability to Capture Individual-Level Treatment Effects: Another limitation of synthetic control is that it provides only an average treatment effect across the treated unit (such as a city or region) rather than at the individual level. This means that while the method can estimate the overall impact of an intervention on a group (e.g., City A in the ride-hailing example), it cannot capture how different individual users or drivers are affected. For businesses that require granular insights — such as the effects on specific segments of users or regions within the treated city — this lack of individual-level data may limit the depth of analysis. This contrasts with randomized controlled trials, which can estimate treatment effects at both the group and individual level

Conclusion

Synthetic control offers a sophisticated and reliable approach for estimating causal impacts in A/B testing, especially in scenarios where traditional methods like randomization are difficult or costly to implement. By constructing a weighted combination of control units, synthetic control provides a counterfactual that allows for accurate estimation of the treatment effect, even in complex environments with multiple confounding factors, such as trends and seasonality.

This method is particularly useful for cases where interventions might spill over between groups or when data from randomized trials are hard to obtain. It allows organizations to gain insights into the effects of interventions at a larger scale while addressing potential biases inherent in traditional A/B testing methods.

However, synthetic control has its limitations. Its effectiveness relies on the availability of high-quality data and similar control units. It also provides an average treatment effect across a population rather than individual-level insights, making it less suitable for granular user analysis. Additionally, computational complexity and susceptibility to overfitting are challenges that need to be managed carefully.

Despite these limitations, synthetic control remains a powerful tool in the data scientist’s toolkit, offering a way to derive meaningful, statistically significant insights in challenging experimental environments. With careful application and consideration of its drawbacks, it can significantly enhance decision-making in fields ranging from ride-hailing to online platforms and beyond.

If you enjoyed this blog and want to read more about data science, experimentation, and causal inference, feel free to follow my work here. You can also connect with me on LinkedIn to stay updated on the latest insights and discussions!

References

--

--

Suraj Bansal
Suraj Bansal

Written by Suraj Bansal

"From Amsterdam's academia to the retail giant Walmart, now decoding data mysteries at Uber. 🎓🛒📊 #DataSorcerer

Responses (1)