Using Synthetic Control Methods for Causal Inference: A Step-by-Step Guide

Philippe Dagher
19 min readApr 21, 2023

--

Causal inference is a critical aspect of data analysis, as it aims to determine the causal effect of a treatment or intervention on an outcome of interest. In real-world scenarios, it is often challenging to establish causality due to the presence of confounding factors and the inability to conduct randomized experiments. Synthetic control methods have emerged as a powerful tool to address these challenges, providing a data-driven approach to estimating causal effects in observational studies.

The purpose of this blog post is to provide a step-by-step guide for implementing two synthetic control methods, demonstrating their application and interpretation using a toy dataset. We will walk you through the entire process, from data preprocessing to the implementation of the two methods, as well as the visualization of the results. By the end of this post, you will have a solid understanding of synthetic control methods and be equipped to apply them to your own datasets for causal inference.

Synthetic Control Methods: An Overview

Synthetic control methods are a family of techniques used to estimate the causal effect of an intervention by constructing a counterfactual unit or group, which serves as a comparison for the treated unit. These methods rely on the combination of multiple untreated units, referred to as the donor pool, to create a synthetic control that closely resembles the treated unit in terms of pre-intervention outcomes and characteristics. By comparing the post-intervention outcomes of the treated unit with those of the synthetic control, we can estimate the causal effect of the intervention.

In this blog post, we will discuss two distinct synthetic control methods:

  1. Regression-based Synthetic Control: This method involves using pre-treatment variables that are closely related to the outcome of interest to estimate a model that predicts the outcome in the absence of treatment. A synthetic control group is then constructed using post-treatment data, with weights assigned to the control units based on their similarity to the treated unit in terms of pre-treatment variables. The causal effect is estimated by comparing the observed outcome of the treated unit to the predicted outcome from the synthetic control group.
  2. Weighted Donor Pool Synthetic Control: This approach involves constructing a synthetic control group that is a weighted average of the donor pool, with untreated units similar to the treated unit in terms of pre-intervention outcomes. The weights are estimated to minimize the difference between the pre-intervention outcomes of the synthetic control unit and the treated unit. The causal effect is then estimated by comparing the observed outcome after the intervention to the predicted outcome without the intervention.

Both methods offer unique advantages and can be tailored to the specific needs of a given study. In the following sections, we will provide a detailed, step-by-step guide to implementing these methods using a toy dataset.

Preparing the Data

To demonstrate the implementation of the two synthetic control methods, we will use a toy dataset that simulates the popularity and conversion rates of a product and its associated term over a 28-day period. The dataset includes a treated unit (product injection) and multiple control units. The goal is to estimate the causal effect of the product injection on the profit generated.

The toy dataset contains the following variables:

  • unit_id: A unique identifier for each unit (treated and control units).
  • time: The day of observation, ranging from 1 to 28.
  • treatment: A binary indicator, with 1 representing the treated unit and 0 representing control units.
  • popularity_term: The popularity of the term associated with the product.
  • popularity_injected: The popularity of the injected product.
  • conversion_rate_term: The conversion rate of the term associated with the product.
  • conversion_rate_injected: The conversion rate of the injected product.
  • profit: The profit generated at each time step.
import numpy as np
import pandas as pd

np.random.seed(42)

n_units = 6
n_days = 28
n_pre = 14

data = []

for unit in range(n_units):
for day in range(1, n_days + 1):
treatment = 1 if unit == 0 else 0
time_trend = day / n_days

popularity_term = np.random.normal(loc=50 + 10 * time_trend, scale=5)
popularity_injected = np.random.normal(loc=30 + 10 * time_trend, scale=3)
conversion_rate_term = np.random.normal(loc=0.1 + 0.02 * time_trend, scale=0.01)
conversion_rate_injected = np.random.normal(loc=0.05 + 0.01 * time_trend, scale=0.005)

if unit == 0 and day > n_pre:
conversion_rate_injected += 0.03

profit = (popularity_term * conversion_rate_term + popularity_injected * conversion_rate_injected) * 10

data.append([unit, day, treatment, popularity_term, popularity_injected,
conversion_rate_term, conversion_rate_injected, profit])

columns = ['unit_id', 'time', 'treatment', 'popularity_term', 'popularity_injected',
'conversion_rate_term', 'conversion_rate_injected', 'profit']

df = pd.DataFrame(data, columns=columns)

Before implementing the synthetic control methods, we need to preprocess the data to ensure it is suitable for analysis. The main preprocessing steps include:

  1. Rescaling: To make the variables comparable, we rescale each time series independently based on the mean and standard deviation of the treated unit profit during the pre-treatment period. This step is crucial for accurately estimating the weights and predicting outcomes.
  2. Feature Engineering: To account for potential time-based patterns in the data, we add the day of the week as a feature by one-hot encoding it and then standard scaling and rescaling it to the profit scale. This provides additional information to the models, allowing them to better capture the underlying trends in the data.
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Convert time to day of the week (0-6)
df['day_of_week'] = df['time'] % 7

# Apply one-hot encoding to the day_of_week
encoder = OneHotEncoder(sparse=False)
encoded_days = encoder.fit_transform(df[['day_of_week']])
day_columns = [f'day_{i}' for i in range(encoded_days.shape[1])]
encoded_days_df = pd.DataFrame(encoded_days, columns=day_columns)

# Combine the original DataFrame with the one-hot encoded day of the week
df_with_days = pd.concat([df, encoded_days_df], axis=1)

# Standardize the time series for each variable and each unit independently
scaler = StandardScaler()
scaled_df = df_with_days.copy()

for unit in scaled_df['unit_id'].unique():
unit_data = scaled_df[scaled_df['unit_id'] == unit]
for column in ['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected', 'profit'] + day_columns:
scaled_df.loc[unit_data.index, column] = scaler.fit_transform(unit_data[[column]])

# Calculate the mean and standard deviation of the treated unit's profit during the pre-period
treated_pre_period = df_with_days[(df_with_days['treatment'] == 1) & (df_with_days['time'] <= n_pre)]
mean_treated_profit_pre = treated_pre_period['profit'].mean()
std_treated_profit_pre = treated_pre_period['profit'].std()

# Define a function to rescale the standardized time series
def rescale_time_series(series, mean, std):
return (series * std) + mean

# Rescale all standardized variables for all units
rescaled_df = scaled_df.copy()
for column in ['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected', 'profit'] + day_columns:
rescaled_df[column] = rescale_time_series(scaled_df[column], mean_treated_profit_pre, std_treated_profit_pre)

With the data preprocessed and the features engineered, we can now proceed to implement the two synthetic control methods, starting with the regression-based approach.

Method 1: Linear Regression with Pre-Treatment Variables

The first synthetic control method we will discuss involves using a linear regression model with pre-treatment variables to predict the outcome of interest. The main steps for implementing this method are as follows:

  1. Select pre-treatment variables that are closely related to the outcome of interest.
  2. Estimate a linear regression model using pre-treatment data to relate the pre-treatment variables to the outcome.
  3. Predict the outcome that would have been observed in the absence of the treatment, using post-treatment data.
  4. Construct a synthetic control group by assigning weights to the control units based on their similarity to the treated unit in terms of pre-treatment variables.
  5. Compare the observed outcome of the treated unit to the predicted outcome from the synthetic control group to estimate the causal effect of the treatment.

First we filter the pre-period data for control units and calculates the daily Euclidean distance between the treated unit and each control unit during the pre-period based on their features and profit. The control units are then sorted by their daily Euclidean distances, and the two closest control units are selected. Finally, the data is filtered to include only the treated unit and the two closest control units.

from scipy.spatial.distance import cdist

# Filter the pre-period data for control units
control_pre_period = rescaled_df[(rescaled_df['treatment'] == 0) & (rescaled_df['time'] <= n_pre)]

# Calculate the daily Euclidean distance between the treated unit and each control unit during the pre-period
daily_distances = []

for unit in control_pre_period['unit_id'].unique():
unit_data = control_pre_period[control_pre_period['unit_id'] == unit][['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected', 'profit']]
treated_data = treated_pre_period[['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected', 'profit']]
daily_distance = cdist(unit_data, treated_data, metric='euclidean').sum()
daily_distances.append((unit, daily_distance))

# Sort the control units by their daily Euclidean distances and select the two closest control units
closest_units = sorted(daily_distances, key=lambda x: x[1])[:2]
closest_units_indices = [unit for unit, distance in closest_units]

# Filter the data to include only the treated unit and the two closest control units
selected_units = rescaled_df[rescaled_df['unit_id'].isin([0] + closest_units_indices)]

print("Closest control units:", closest_units_indices)

To implement this method using Python, we first select the relevant pre-treatment variables and fit a linear regression model on the pre-treatment data. Then, we calculate the Euclidean distances between each control unit’s record in the post-treatment period and the treated unit’s pre-treatment data. Based on these distances, we assign weights to each control unit and use them to construct a synthetic control group.

Once the synthetic control group is constructed, we can compare the treated unit’s observed outcome to the predicted outcome from the synthetic control group to estimate the causal effect of the treatment. We can also analyze the linear regression model’s coefficients to gain insights into the importance of each feature in predicting the outcome of interest.

import numpy as np
from sklearn.linear_model import LinearRegression
from scipy.spatial.distance import cdist

# Select pre-treatment variables, add one-hot-encoded day of the week columns to the features
features = ['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected']
day_columns = [f'day_{i}' for i in range(6)]
features += day_columns

# Filter pre-treatment data for the treated unit and the two closest control units
pre_treatment_data = selected_units[selected_units['time'] <= n_pre]

# Estimate a linear regression model using pre-treatment data
X_pre_treatment = pre_treatment_data[features]
y_pre_treatment = pre_treatment_data['profit']

reg = LinearRegression().fit(X_pre_treatment, y_pre_treatment)

# Calculate distances between each record of a control unit in the post-period and the pre-period test
post_treatment_data = selected_units[selected_units['time'] > n_pre]
distances = []
for _, row in post_treatment_data.iterrows():
control_unit_data = row[features].values.reshape(1, -1)
treated_unit_data = treated_pre_period[features].values
distance = cdist(control_unit_data, treated_unit_data, metric='euclidean').min()
distances.append(distance)

post_treatment_data['distance'] = distances
post_treatment_data['weight'] = 1 / post_treatment_data['distance']

# Predict the synthetic outcome using the weights based on similarity to the pre-period test
X_post_treatment = post_treatment_data[features]
predicted_outcome = reg.predict(X_post_treatment)
post_treatment_data['predicted_profit'] = predicted_outcome

# Create a synthetic control group using the post-treatment data and the predicted outcomes
synthetic_control_group = post_treatment_data[post_treatment_data['treatment'] == 0]

# Calculate the weighted predicted profit for the synthetic control group
synthetic_control_group['weighted_predicted_profit'] = synthetic_control_group['predicted_profit'] * synthetic_control_group['weight'] / synthetic_control_group['weight'].sum()
synthetic_control_group = synthetic_control_group.groupby('time').agg({'weighted_predicted_profit': 'sum'})

# Compare the observed outcome of the treated unit to the predicted outcome from the synthetic control group
treated_unit_data = post_treatment_data[post_treatment_data['unit_id'] == 0]
treated_unit_observed_outcome = treated_unit_data['profit'].mean()
treated_unit_predicted_outcome = synthetic_control_group['weighted_predicted_profit'].sum()

# Calculate the causal effect
causal_effect = np.mean(treated_unit_observed_outcome - treated_unit_predicted_outcome)
print("Estimated causal effect of the treatment:", causal_effect)

By implementing this method, we can estimate the causal effect of the treatment and identify the key features that have a significant impact on the outcome. This information can be useful for decision-making, resource allocation, and further research.

# Extract feature coefficients from the linear regression model
feature_coefficients = reg.coef_

# Pair features with their coefficients and sort by magnitude
feature_importances = sorted(zip(features, feature_coefficients), key=lambda x: abs(x[1]), reverse=True)

# Print feature importances
for feature, coefficient in feature_importances:
print(f"{feature}: {coefficient}")

Method 2: Weighted Least Squares with Stacked Variables

The second synthetic control method we will explore involves constructing a synthetic control group that is a weighted average of the donor pool using weighted least squares (WLS) with stacked variables. The main steps for implementing this method are as follows:

  1. Stack all pre-treatment variables, creating a single column per unit.
  2. Estimate the weights of the donor pool using weighted least squares to minimize the difference between the pre-intervention outcomes of the synthetic control unit and the treated unit.
  3. Construct a synthetic control group using the estimated weights.
  4. Estimate the causal effect by comparing the observed outcome after the intervention to the predicted outcome without the intervention.

To implement this method using Python, we first stack all pre-treatment variables into a single column per unit. Then, we use weighted least squares to estimate the weights of the donor pool, minimizing the difference between the synthetic control unit’s and the treated unit’s pre-intervention outcomes. The choice of weights is crucial for obtaining accurate results, as it determines the extent to which each control unit contributes to the synthetic control group.

Once the synthetic control group is constructed, we can estimate the causal effect by comparing the treated unit’s observed outcome after the intervention to the predicted outcome without the intervention. This method allows us to estimate the causal effect while considering the relationships between all variables in the dataset.

from scipy.optimize import minimize

def weighted_mse_stacked(w, treated, control):
synthetic_control = (control * w).sum(axis=1)
return ((treated - synthetic_control) ** 2).mean()

# Stack all variables for the treated unit and control units
treated_pre_period_stacked = np.hstack([treated_pre_period[var].values for var in variables])

control_pre_period_stacked = []
for unit_id in selected_units['unit_id'].unique():
if unit_id != 0:
control_unit_data = selected_units[(selected_units['unit_id'] == unit_id) & (selected_units['time'] <= n_pre)]
stacked_data = np.hstack([control_unit_data[var].values for var in variables])
control_pre_period_stacked.append(stacked_data)
control_pre_period_stacked = np.column_stack(control_pre_period_stacked)

# Estimate the weights
initial_weights = np.array([0.5, 0.5])
res = minimize(weighted_mse_stacked, initial_weights, args=(treated_pre_period_stacked, control_pre_period_stacked), bounds=[(0, 1), (0, 1)], constraints={'type': 'eq', 'fun': lambda w: w.sum() - 1})

# Apply the weights to the control units
optimal_weights = res.x
synthetic_control_pre_period_stacked = (control_pre_period_stacked * optimal_weights).sum(axis=1)

# Calculate the predicted outcome without the intervention
control_post_period_stacked = []
for unit_id in selected_units['unit_id'].unique():
if unit_id != 0:
control_unit_data = selected_units[(selected_units['unit_id'] == unit_id) & (selected_units['time'] > n_pre)]
stacked_data = np.hstack([control_unit_data[var].values for var in ['profit'])
control_post_period_stacked.append(stacked_data)
control_post_period_stacked = np.column_stack(control_post_period_stacked)

synthetic_control_post_period_stacked = (control_post_period_stacked * optimal_weights).sum(axis=1)

# Compare the observed outcome after the intervention to the predicted outcome without the intervention
observed_outcome_stacked = np.hstack([test_profit_data[test_profit_data['time'] > n_pre][var].values for var in variables])
predicted_outcome_stacked = synthetic_control_post_period_stacked

# Estimate the causal effect
causal_effect_stacked = np.mean(observed_outcome_stacked - predicted_outcome_stacked)
print("Estimated causal effect of the treatment using stacked variables:", causal_effect_stacked)

By implementing this method, we can obtain a more comprehensive understanding of the causal effect of the treatment and the role that different variables play in determining the outcome. The choice of weights is critical for accurate results, making it essential to carefully consider the weights and their potential impact on the synthetic control group’s construction.

Comparing the Two Methods

In this section, we will discuss the similarities and differences between the two synthetic control methods we have explored, as well as their respective strengths and weaknesses.

Similarities:

  1. Both methods involve constructing a synthetic control group using a donor pool of untreated units.
  2. The goal of both methods is to estimate the causal effect of an intervention by comparing the observed outcome after the intervention to the predicted outcome without the intervention.

Differences:

  1. Method 1 uses linear regression with pre-treatment variables to create a synthetic control group, while Method 2 employs weighted least squares with stacked variables.
  2. In Method 1, the synthetic control group is constructed based on the similarity of the pre-treatment variables, while in Method 2, the weights are estimated to minimize the difference between the pre-intervention outcomes of the synthetic control unit and the treated unit.

When to Use Each Method:

  1. Method 1 is suitable when there are clear pre-treatment variables that are closely related to the outcome of interest. This method allows for easy interpretation of feature importance, helping us understand which variables have the most significant impact on the causal effect.
  2. Method 2 is appropriate when we want to consider the relationships between all variables in the dataset and obtain a more comprehensive understanding of the causal effect. This method is particularly useful when it is difficult to identify specific pre-treatment variables or when the relationships between variables are complex.

Strengths and Weaknesses:

  1. Method 1 offers straightforward interpretation and is relatively easy to implement. However, it may not capture the relationships between all variables in the dataset and may be less accurate if the chosen pre-treatment variables are not closely related to the outcome.
  2. Method 2 provides a more comprehensive approach to estimating the causal effect by considering all variables in the dataset. It can potentially yield more accurate results, but it is more complex to implement, and the choice of weights is critical for obtaining accurate results.

In conclusion, both synthetic control methods have their merits and can be used to estimate causal effects depending on the specific context and requirements of the analysis. It is essential to carefully consider the choice of method, the variables used, and the weights assigned to achieve the most accurate and meaningful results.

Visualizing the Results

In this section, we will show how to create plots to visualize the synthetic control data and the causal effects, as well as discuss the insights gained from these visualizations.

Scatter Plots of Covariates and Test Profit Line Plot:

Create scatter plots of all covariates for control units and test units, along with a line plot of test profit.

# Define the colors for control and test units
control_color = 'lightblue'
test_color = 'red'

# Define the covariates to be plotted
covariates = ['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected']
control_profit_data = selected_units[selected_units['unit_id'] != 0]
test_profit_data = selected_units[selected_units['unit_id'] == 0]

plt.figure(figsize=(10, 6))

for i, covariate in enumerate(covariates):
# Create a scatter plot for control units
plt.scatter(control_profit_data['time'], control_profit_data[covariate], color=control_color, alpha=0.5, label=f'Control units - {covariate}')

# Create a scatter plot for test units
plt.scatter(test_profit_data['time'], test_profit_data[covariate], color=test_color, alpha=0.5, label=f'Test units - {covariate}')

# Test profit line plot
plt.plot(test_profit_data['time'], test_profit_data['profit'], color='green', label='Test Profit')

plt.xlabel('Time')
plt.ylabel('Covariates')
plt.title('Scatter Plot of Covariates for Control and Test Units with Test Profit Line Plot')
plt.legend()

plt.show()

By observing the scatter plots, we can gain several insights into the relationship between the covariates and test profit, as well as the overall trend of test profit over time:

Covariate relationship with test profit: By comparing the positions of the test unit and control unit points on the scatter plot, we can identify potential relationships between the covariates and test profit. If test unit points consistently align with higher profit values when certain covariate values are high or low, it may suggest a positive or negative relationship between that covariate and test profit.

Test unit versus control units: Examining the distribution of test and control unit points can provide an indication of how similar the test unit is to the control units in terms of the covariates. If the test unit points are close to the control unit points, it suggests that the test unit is well-represented by the control units, which can lead to more reliable causal effect estimates.

Test profit trend over time: By analyzing the test profit line plot, we can identify trends and patterns in the test profit over time. For example, we may observe an upward or downward trend, indicating an increase or decrease in profit over time, or we may see a cyclical pattern that suggests seasonality or other recurring factors that influence test profit.

Interaction between covariates: The scatter plots can also help us identify potential interactions between the covariates. If we observe that the test unit’s profit increases when two specific covariates are simultaneously high or low, it could indicate that these covariates interact with each other to influence test profit.

Synthetic Control Data Points and Assigned Weights:

Create a scatter plot representing each synthetic control data point along with the weight assigned to it.

# Filter the synthetic control group data
synthetic_control_points = post_treatment_data[post_treatment_data['treatment'] == 0]

# Filter the test profit data before and after the treatment
test_profit_data = selected_units[selected_units['unit_id'] == 0]

# Filter the control data before the treatment
control_pre_treatment_data = pre_treatment_data[pre_treatment_data['treatment'] == 0]

# Calculate the predicted profit for the control units during the pre-treatment period
control_pre_treatment_features = control_pre_treatment_data[features]
control_pre_treatment_predicted_profit = reg.predict(control_pre_treatment_features)
control_pre_treatment_data['predicted_profit'] = control_pre_treatment_predicted_profit

# Create a scatter plot with the weight as the size of the points (post-treatment)
plt.scatter(synthetic_control_points['time'], synthetic_control_points['predicted_profit'],
s=synthetic_control_points['weight']/synthetic_control_points['weight'].sum()*1000, alpha=0.5, label='Synthetic Control Points (Post)')

# Create a scatter plot of predicted profit for controls before the treatment
plt.scatter(control_pre_treatment_data['time'], control_pre_treatment_data['predicted_profit'], alpha=0.5,
label='Predicted Profit for Control Points (Pre)')

# Plot the test profit data before and after the treatment as a line
plt.plot(test_profit_data['time'], test_profit_data['profit'], color='red', label='Test Profit')

plt.xlabel('Time')
plt.ylabel('Profit')
plt.title('Synthetic Control Data Points with Assigned Weights, Predicted Profit for Controls Before, and Test Profit')
plt.legend()

plt.show()

Observing the distribution of weights in the scatter plot can provide valuable insights about the importance of different control units in constructing the synthetic control group. Some insights that can be gained include:

Relative importance of control units: Larger circles in the scatter plot represent higher weights, indicating that these control units are more important in constructing the synthetic control group. Smaller circles signify lower weights, suggesting that these control units have less influence on the synthetic control group. By examining the distribution of weights, you can identify which control units contribute the most to the synthetic control group.

Similarity between control units and treated unit: Higher weights assigned to control units imply a higher degree of similarity between those control units and the treated unit in terms of pre-treatment characteristics. By analyzing the weights, you can assess how well the synthetic control group is able to approximate the treated unit based on the pre-treatment variables.

Diversity in the synthetic control group: If the distribution of weights is highly skewed, with a few control units having very high weights and the rest having low weights, it may indicate that only a few control units are highly similar to the treated unit. On the other hand, if the weights are more evenly distributed, it suggests that multiple control units contribute similarly to the synthetic control group, resulting in a more diverse and representative control group.

Stability of the synthetic control group: The distribution of weights can also provide insights into the stability of the synthetic control group. If the weights are highly concentrated on a small number of control units, the synthetic control group might be sensitive to changes in those particular units. Conversely, a more evenly distributed set of weights might lead to a more stable synthetic control group, as it would be less sensitive to changes in individual control units.

Pre- and Post-Treatment Profit Comparison:

Create a plot comparing the observed outcome of the treated unit with the predicted outcome from the synthetic control group, both before and after the intervention.

# Define the colors for control and test units
control_color = 'lightblue'

# Define the covariates to be plotted
covariates = ['popularity_term', 'popularity_injected', 'conversion_rate_term', 'conversion_rate_injected']
control_profit_data = selected_units[selected_units['unit_id'] != 0]

# Create a scatter plot for each covariate
plt.figure()
for covariate in covariates:
plt.scatter(control_profit_data['time'], control_profit_data[covariate], color=control_color, alpha=0.5)

# Test profit line plot
plt.plot(test_profit_data['time'], test_profit_data['profit'], color='red', label='Test Profit')

# Control covariates mean line plot
covariates_mean = control_profit_data.groupby('time')[covariates].apply(lambda x: x.values.mean())
plt.plot(covariates_mean.index, covariates_mean, color='blue', label='Control covariates "overall mean"')

plt.xlabel('Time')
plt.ylabel('Covariate Value')
plt.title('Scatter Plot of Covariates for Control and Test Profit Line Plot')
plt.legend()

plt.show()

This plot provides several valuable insights into the estimated causal effect of the intervention and the overall performance of the synthetic control method in predicting the counterfactual outcome:

Estimated causal effect: The difference between the observed outcome of the treated unit (red line) and the predicted outcome from the synthetic control group (blue line) after the intervention represents the estimated causal effect of the treatment. A larger difference indicates a more substantial impact of the intervention on the treated unit’s profit.

Pre-intervention fit: The alignment of the red line and the blue line before the intervention demonstrates the synthetic control method’s ability to match the treated unit with a suitable control group. A closer fit suggests that the synthetic control group is a good representation of the treated unit’s counterfactual outcome in the absence of the intervention.

Post-intervention deviation: The deviation between the red line and the solid blue line after the intervention provides an indication of the treatment’s effectiveness. A larger deviation suggests a more significant impact of the treatment, while a smaller deviation indicates a lesser effect.

Synthetic control performance: The overall performance of the synthetic control method can be assessed by examining how well the predicted outcome (blue line) follows the observed outcome (red line) before and after the intervention. A better match indicates that the synthetic control method has successfully identified a suitable control group that can effectively estimate the counterfactual outcome of the treated unit.

Feature Importance in Method 1:

Create a bar chart representing the feature importance obtained from the linear regression model in Method 1.

import matplotlib.pyplot as plt
import pandas as pd

# Extract the feature names and their corresponding coefficients from the linear regression model
feature_importance = {feature: coef for feature, coef in zip(features, reg.coef_)}

# Create a pandas DataFrame for the feature importance
feature_importance_df = pd.DataFrame(list(feature_importance.items()), columns=['Feature', 'Coefficient'])

# Sort the DataFrame by the absolute value of the coefficients
feature_importance_df = feature_importance_df.reindex(feature_importance_df.Coefficient.abs().sort_values(ascending=False).index)

# Create a bar chart for the feature importance
plt.figure(figsize=(10, 5))
plt.bar(feature_importance_df['Feature'], feature_importance_df['Coefficient'], color='b')
plt.xticks(rotation=45)
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.title('Feature Importance from Linear Regression Model (Method 1)')

plt.show()

From the bar chart representing the feature importance, you can gain several insights about the relationship between the pre-treatment variables and the causal effect:

Identifying the most significant variables: By observing the coefficients’ magnitude in the bar chart, you can identify which pre-treatment variables have the most significant impact on the causal effect. A larger absolute value of the coefficient suggests a stronger relationship between the variable and the outcome. This information can help you focus on the key factors driving the causal effect.

Assessing the direction of the impact: The coefficients’ signs indicate the direction of the relationship between the pre-treatment variables and the causal effect. A positive coefficient suggests that an increase in the variable is associated with an increase in the outcome, while a negative coefficient implies an inverse relationship.

Evaluating the relative importance of variables: The chart enables you to compare the coefficients across the pre-treatment variables, allowing you to rank them in terms of their relative importance. This can be useful for prioritizing interventions, allocating resources, or targeting specific factors for further investigation.

Informing future analyses and decision-making: Understanding which pre-treatment variables are most influential can help guide future data collection, research, and policy decisions. For instance, you may decide to investigate why certain variables have a more significant impact or explore potential interventions that can modify these key factors to improve the outcome of interest.

By creating and analyzing these visualizations, we can gain a deeper understanding of the relationships between variables, the performance of the synthetic control methods, and the estimated causal effects of the interventions. These insights can help guide decision-making and further refine the synthetic control models to achieve more accurate and meaningful results.

Conclusion

In this blog post, we have provided a comprehensive guide to implementing and comparing two synthetic control methods for causal inference using Python. We began by introducing the concept of causal inference and the need for synthetic control methods in various fields. We then provided an overview of the two methods discussed in the post and walked through the process of preparing the data, including rescaling and feature engineering.

For each method, we provided a step-by-step implementation using Python code and discussed the results, insights on feature importance, and the choice of weights. We also demonstrated how to create various visualizations to better understand the relationships between variables, the performance of the synthetic control methods, and the estimated causal effects.

In conclusion, synthetic control methods offer a powerful tool for causal inference when properly implemented and compared. By understanding the strengths and weaknesses of each method, practitioners can make informed decisions about which approach to use in different situations. We hope that this blog post serves as a valuable resource for those interested in learning more about synthetic control methods and applying them in their own work.

--

--