Difference-in-Differences

Learn another method to determine treatment effect via non-experimental data

Figarri Keisha
Bukalapak Data
11 min readMar 25, 2022

--

Photo by Artyom Kabajev on Unsplash

The Difference-in-Differences (DiD) method is a statistical technique to calculate the treatment effect by studying the differential effect between a “treatment group” and “control group”. In terms of method hierarchy, DiD belongs to causal inference methods for non-experimental data. In case you missed it, we have written an introductory article on the topic below.

In this article, we will show how the DiD method helps our analysis of our latest business case and how we execute it in the python language.

The Story: Implementation of New Business Strategy

Bukalapak is the market leader in Indonesia’s O2O (online to offline) space (through Mitra Bukalapak), and we are constantly driven to innovate rapidly. We use a variety of business strategies to continually strive for customer satisfaction. As data scientists, it is our responsibility to ensure that the actual impact of our business strategies is not overestimated. So here’s one of our stories on the subject.

One day, our team implemented a new business strategy that was designed to increase the number of transacting users. In order to reduce any possible accidents, this strategy was only being implemented in several cities such as Bandung, Dumai, Medan, and Sidoarjo. The strategy was implemented starting in the last two weeks of July.

Now one natural question is how do we check whether the new strategy works in increasing the number of transacting users and how can we be sure whether the increase happens as the result of the new strategy, not just some trend on the data or other factors.

In the early days of my career, I found before-after study as a common solution for this problem, basically, we compare the number of transacting users from before and after the period of treatment (in this case the implementation of the new strategy). There are a lot of variables that we can’t control and calculate in this analysis. The before-after study cannot rule out something other than the treatment may have caused a change[1]. We will review why this approach is problematic later in the article.

This is where difference-in-differences (DiD) play a part. The difference-in-differences (DID) technique originated in the field of econometrics, but the logic underlying the technique has been used as early as the 1850s by John Snow and is called the “controlled before-and-after study” in some social sciences [2]. For the rest of the article, we will gradually reveal DiD concepts through a working example.

The Question: Did the new business strategy actually improve the number of transacting users?

Timeline: the new business strategy started on 19th July for 2 weeks, to reduce the effect of a weekly trend for different weeks, we compare the metrics with a similar timeline in June which starts on 21st June also for 2 weeks.

Illustration of the Timeline

Comparing Before — After Metrics

Metric Comparison Before and After the Treatment

It is quite simple to compare before and after metric as the table above shows that 2 targeted cities were not performing well, as the WTU (weekly transacting users) drop from the June period (before we implemented the new strategy) to the July period (when we started implementing the new strategy). By looking at the numbers, could we state that our new strategy did not improve the number of transacting users?

Not yet! What if the reality is that all of our metrics in all cities were declining in July? So by only comparing before-after metrics on the targeted cities, we need to make the assumption that the new strategy was the only thing that could have affected the number of transacting users.

Differences-in-differences

To address the previous concern, we should also check the WTU trend of cities that were not directly impacted by the new business strategy. So the idea is like creating a control group (for A/B test matter) for our targeted cities, once we have cities as a proper comparison, we can calculate the difference-in-differences by subtracting the difference in targeted cities and the difference in non-targeted cities.

The Difference-in-Differences Estimation

Turns out, the declining trend not only happened in the targeted cities but also in non-targeted cities, even though the number is not as higher as targeted cities, therefore the difference-in-difference estimation has a negative value indicating that targeted cities have a steeper declining trend compared to non-targeted cities.

Parallel-Trend Assumption

Now, the question is how can we choose the comparison group? Can we randomly select the cities? Or is there any specific method that we need to follow?

This is the tricky part of DiD method, so basically DiD holds a key assumption, which is called a parallel-trend assumption. The parallel-trend assumption states that if no treatment had occurred, the difference between the treated group and the untreated group would have stayed the same in the post-treatment period as it was in the pre-treatment period [3].

Parallel-Trend Assumption

One way to check the parallel trend assumption is by plotting the pre-treatment trend, whether the changes in weekly transacting users in parallel between “control” and “variant” cities. Therefore in the context of our example (see the figure below), we notice that the trend of Bandung and Medan (orange-line) are in parallel with the trend of Pakayumbuh, Lampung, and Mojokerto (blue-line), hence we assume that the cities will have a constant difference (overtime) if no treatment occurred. The intervention effect is shown as the difference between the orange line (variant-before) and the red line (variant-after) and it aligns with the negative DiD estimation in the previous calculation.

Weekly Trend of Transacting Users between “Control” vs “Variant” Group

Difference-in-Difference with Python

To properly calculate the DiD estimation, we can use a simple and highly intuitive approach, by leveraging the concept of regression. Besides giving us the exact same results as directly calculating (treated group after — treated group before) — (untreated group after — untreated group before), it is also capable of addressing multi-group designs. In our context, we could have multiple cities assigned as control and several other cities assigned as the treated group, instead of only having one city for each group (i.e. multi-group design). Therefore, we can give the method a try!

The method is called the “Two-way fixed effects difference-in-difference estimator” since it has two sets of fixed effects, one for the group (“control” & “variant” group) and one for the time period (before & after treatment). This can be achieved by using OLS regression (as shown below)

where α𝓰 is a set of fixed effects for the group that you’re in (“Treated” equal as “variant” or “Untreated” equal as “control”) and αₜ is a set of fixed effects for the time period you’re in (“before treatment” and “after treatment”). Treated, then, is a binary variable indicating that you are being treated right now (you’re in a treated group in the after-treatment period). The coefficient on Treated is your difference-in-differences effect [3].

Initialization

First, let’s import the necessary libraries and our base data. The data contains the number of transacting users in a particular city and week.

import pandas as pd
import matplotlib as plt
import linearmodels as lm

df = pd.read_csv('csv/campaign.csv')
df.head()import pandas as pd
import matplotlib as plt
import linearmodels as lm

df = pd.read_csv('csv/campaign.csv')
df.head()

Output:

We set some adjustments to the data frame to match the OLS equation, i.e set city and week_index columns as an index in the data frame (as fixed effects in OLS equation) and generate a binary value of “Treated” with 1 indicating treated cities after the treatment period.

# Divide the city based on the treatment (whether it is treated or not) 
treated_cities = ['medan', 'bandung']
untreated_cities = ['payakumbuh', 'lampung', 'mojokerto']

# Transform dataframe to match the equation
cities_df = df[df['city'].isin(treated_cities+untreated_cities)]
cities_df['variant_cities'] = cities_df['city'].isin(treated_cities)
cities_df['after'] = cities_df['week_index'] > 3
cities_df['treated'] = 1*(cities_df['variant_cities'] & cities_df['after'])

# Set city and week as (index) for our data
ols_df = cities_df.set_index(['city','week_index'])
ols_df.head()

Output:

Calculate DiD Estimation

By making use of the PanelOLS function from linearmodels package, the result of DiD estimation can be seen as a parameter in the output values. With a p-value < .01 (small enough to be considered statistically significant), we can reject the null hypothesis that the DiD estimation is 0 or to be precise around -132, which indicates that the campaign did not perform quite enough to increase the number of transacting users.

# Set formula for OLS regression
mod = lm.PanelOLS.from_formula('''wtu ~
treated + EntityEffects + TimeEffects''',ols_df)

# Specify clustering when we fit the model
clfe = mod.fit(cov_type = 'clustered', cluster_entity = True)
print(clfe)

Output:

Check Parallel-Trend using Placebo Test

The idea of the placebo test is to find whether the treated and untreated groups already had differing trends in the leadup to the period where the treatment occurred. The steps to do this are

  1. Use only the data that came before the treatment went into effect.
  2. Pick a fake treatment period.
  3. Estimate the same difference-in-differences model using the fake treatment period.
  4. If you find an “effect” for that treatment date where there really shouldn’t be one, that’s evidence that there’s something wrong with your design, which may imply a violation of parallel trends.
# Keep only pre-treatment data
placebo_df = cities_df[(cities_df['week_index'] <= 4)]

# Transform dataframe to match the equation
placebo_df['variant_cities'] = placebo_df['city'].isin(treated_cities)
placebo_df['fake_after1'] = placebo_df['week_index'] > 2
placebo_df['fake_after2'] = placebo_df['week_index'] > 3
placebo_df['fake_treated1'] = 1*(placebo_df['variant_cities'] & placebo_df['fake_after1'])
placebo_df['fake_treated2'] = 1*(placebo_df['variant_cities'] & placebo_df['fake_after2'])

# Set our individual and time (index) for our data
placebo_ols = placebo_df.set_index(['city','week_index'])

# Run the same model as before
# but with our fake treatment variables
mod1 = lm.PanelOLS.from_formula('''wtu ~
fake_treated1 + EntityEffects + TimeEffects''',placebo_ols)
mod2 = lm.PanelOLS.from_formula('''wtu ~
fake_treated2 + EntityEffects + TimeEffects''',placebo_ols)

clfe1 = mod1.fit(cov_type = 'clustered',
cluster_entity = True)
clfe2 = mod2.fit(cov_type = 'clustered',
cluster_entity = True)

print(clfe1)
print(clfe2)

Output:

The DiD estimations are near-zero and the p-value > 0.05 indicates there are no DiD effects and that’s as it should be! Considering there is no campaign we held back there.

Dynamic Treatment Effect

What we did so far is basically assumed that we are only dealing with two periods of time, before and after the treatment period. Although we can add as long as the period we need, we only estimate an effect in the entire “after” period. Suppose we want to know the effect on a weekly basis to signify whether the effect becomes more or less effective over time or not. To answer this, we can modify the DiD method to allow the dynamic treatment effect. So basically by allowing that, we will be able to see the effect of a treatment in specific periodicals (e.g daily, weekly, monthly, yearly, etc.).

viz_df = df[df['city'].isin(treated_cities+untreated_cities)]
viz_df['variant_cities'] = viz_df['city'].isin(treated_cities)

# Create our interactions by hand,
# skipping quarter 3, the last one before treatment
for i in [1, 2, 3, 5, 6]:
name = 'INX'+str(i)
viz_df[name] = 1*viz_df['variant_cities']
viz_df.loc[viz_df['week_index'] != i, name] = 0

viz_df = viz_df.set_index(['city', 'week_index'])
mod = lm.PanelOLS.from_formula('''wtu ~
INX1 + INX2 + INX3 + INX5 + INX6 +
EntityEffects + TimeEffects''', viz_df)

# Specify clustering when we fit the model
clfe = mod.fit(cov_type = 'clustered',
cluster_entity = True)

# Get coefficients and CIs
res = pd.concat([clfe.params, clfe.std_errors], axis = 1)
# Scale standard error to CI
res['ci'] = res['std_error']*1.96

# Add our quarter values
res['week_index'] = [1, 2, 3, 5, 6]
# And add our reference period back in
reference = pd.DataFrame([[0,0,0,4]],
columns = ['parameter',
'lower',
'upper',
'week_index'])
res = pd.concat([res, reference])

# For plotting, sort and add labels
res = res.sort_values('week_index')
res['week'] = list(df.sort_values('week')['week'].unique())

# Plot the estimates as connected lines with error bars
plt.figure(figsize=(10, 5))
plt.errorbar(x = 'week', y = 'parameter',
yerr = 'ci', data = res)
# Add a horizontal line at 0
plt.axhline(0, linestyle = 'dashed')
plt.axvline(3, linestyle = 'dashed', color='red')
plt.ylim(-600, 600)

Output:

It is expected to have a near-zero value in the three pre-treatment periods (the left part of the red-dashed line), although the confidence interval is quite wide. The impact of the campaign apparently shows a consistent negative value in the first and second post-treatment period. We need to evaluate the campaign that we give, which might not be compatible with our target cities.

How do we utilize the DiD?

To make our analysis more effective, one of our members (Fahmi Amir) created a DiD python library. The concept is similar to what we have done before, there are several variables we need to input, i.e. the data frame, list of treated and untreated groups, the treatment period, and the metrics we want to calculate. What is interesting about this library is that the DiD library is able to find the best combination of treated and untreated groups based on the result of the placebo test. Here’s an example of how we use the library.

from DiD import DiD# state the input
outcome = 'wtu'
treated_group = ['medan', 'bandung']
untreated_group = ['payakumbuh', 'lampung', 'mojokerto']

control_treated_group = treated_group + untreated_group
df = pd.read_csv('csv/campaign.csv')
data_filter = df[df['city'].isin(control_treated_group)]
data_filter.columns = ['week', 'entity_index', 'time_index', 'wtu']
# start the did calculation
did = DiD(
treated_group=treated_group,
treatment_period=4,
outcome=outcome,
df=data_filter
)

did.fit()

Output:

By doing the fit(), it will calculate all possible combinations from the treated and untreated groups. Turns out, our combination of treated and untreated groups are the best combinations, so we are on a good track! We also add the dynamic treatment effect calculation by using only a single function calc_did_each_period()

# calculate dynamic treatment effect
did.calc_did_each_period()

Takeaway

By only comparing the average number of transacting users before and after we gave the campaign (treatment), we found that our campaign is not good enough to increase the number of transacting users. Can we trust the conclusion? Not yet! By using the Difference-in-Differences method (and also following the parallel trend assumption), we can get a more accurate estimate of how the campaign really affects our targeted cities.

Remember! It is not enough to find the causal effect of an event by only comparing the before and after metrics. Correlation != Causation and by using a more controlled method, you have prevented yourself from getting under or over-estimated causal effects.

Credit

Special thanks to Pararawendy Indarjo who helped to proofread this article.

Reference

The storytelling is inspired by this medium and the DiD calculation using python is based on this book. Some other references we used in the article:

  1. Before-and-after study: comparative studies — GOV.UK
  2. Difference-in-Difference Estimation | Columbia Public Health
  3. Chapter 18 — Difference-in-Differences | The Effect

--

--