Inferring Causality from Observational Data: Hands-On Introduction

Eddo Putradipura
Bukalapak Data
Published in
10 min readOct 28, 2021
Photo by Jennifer Griffin on Unsplash

In the previous articles in Bukalapak Data blog series, we talked a lot about A/B test analysis and how to utilise it to assess results of a particular feature change on a subset of sample users. We then make a decision based on the result whether to deploy the changes to users on a large scale.

However, we sometimes face a few circumstances where A/B test is not applicable due to our inability to create this “simulation environment” on some features of interest. Or perhaps even worse, imagine someone from the product team asked you to measure the impact of a new product they have just released a week ago. How’d you analyse this?

In this article, we are going to touch gently on a topic of causal inference, focusing on the analysis of treatment effects on a set of observational data. In contrast to A/B test data, observational data is something that we have no control at all, whereas A/B test data requires you to carefully pick your sample audience to assign them to the control and treatment group as commensurate as possible. We, later on, will discuss matching as one of the methods to approximate the treatment effect.

Study Case: Thematic Campaign

Say that a business team of an e-commerce company is planning to have an upcoming campaign on motorcycle and bicycle products for a short period of time. During the campaign, they put the campaign ads on a primary banner on the app homepage as well as sending email and push-notifications marketing to users to make them aware about the campaign.

Once a user clicks on one of the channels, they will be redirected to the campaign landing page where all featured and selected products are showcased, in which some of them are having a lower price than usual.

We have access to the user data who clicked these marketing channels and their transaction record after they visited the landing page within the campaign period (let’s call them the treated group). We also have another similar data from a subset of users that did not interact with the campaign materials at all (and we call this one the untreated group). The data roughly looks like this:

Figure 1. Campaign visitor raw data illustration.

Now, to measure the impact, we might be tempted to do one or more of the following things:

  1. Subtract the mean average of spending between the group of users that clicked the campaign with the one who didn’t, easy-peasy right?
  2. Or run a parametric or non-parametric t-test depending on how the data looks like,
  3. Or even better we can kill two birds with one stone, by performing least square estimation on regressing the total spending with respect to the group users as a dummy variable. Then we’ll find the estimated intercept and slope to compute the average difference as well as its p-value to assess its significance.

While these methods are valid to evaluate the impact of the campaign, our intuition should tell us there is a missing step before computing the treatment effect: are both the treated and the untreated comparable to begin with?

Randomisation and Selection Bias

When doing an A/B test, one of the most important steps is to assign our audience randomly to both the control and treatment group. Otherwise, the result will lead to an unreliable estimate due to some bias that might occur within the groups.

Figure 2. Causal diagram of the thematic campaign. Explanation: having an interest in particular categories will make users to be attracted to the campaign (A ➝ B) or directly purchasing product with associated categories regardless of the campaign (A ➝ C), and visiting the campaign will likely cause users to purchase some products they’ve seen on the landing page (B ➝ C).

Now, during the campaign, users that were attracted to the ads might be users that have strong affinity to bicycle/motorcycle products in the first place. Whereas users who didn’t pay any attention to it might simply have no interest at all to the products offered in the campaign. This is illustrated in Figure 2 above:

  1. We expect a user who clicked the campaign marketing channels will proceed to make a purchase afterwards, since they should be tempted to make one once visited the landing page, as represented by the flow from point B → C.
  2. However, if a person has that interest already, he will either have a strong tendency to click on the campaign channels (A → B) or even to make a purchase directly with or without considering the campaign (B → C).

This gives rise to the selection bias problem, where a randomised sample of users is something that possibly non-existent across the groups under this scenario. Our goal is to eliminate this possibility of bias when analysing the campaign impact so we can get an accurate estimate, in which the previous three direct methods that were mentioned before (comparing means, t-test, or least square estimation) are somehow lacking.

Fortunately, causal inference techniques are available and we can make a good use out of them on top of classical statistical techniques. For this occasion, we’ll introduce matching to tackle this problem.

Matching principles

Matching, as the name suggests, aims to find the most identical match between the treated and untreated groups based on the similarity of their attributes. This way, we can make both the treated and untreated groups to be comparable. However, we need to determine these attributes first.

Consider again the previous data, but now it also includes several attributes that might affect users’ interest with the campaign: total number of motorcycle and bicycle transactions for the last 6 months, gender, and account age.

These attributes are what we call confounding variables. In real life, it is a challenging task to find suitable confounders that fit both the treatment type and the metrics that we are looking into and it can be even hundreds of them. But for illustrative purposes, I only provide three for now. Then, let us generate hypothetical data as follows:

# Set seed for reproducible outputs
np.random.seed(897)
# Create hypothetical data with 100 observations
df = pd.DataFrame({'user_id': np.random.randint(111111,999999,100), 'clicked': [1]*20 + [0]*80,
'past_trx': np.hstack((np.random.randint(30,60,20), np.random.randint(0,150,80))),
'gender': np.random.choice(2, 100),
'age': np.random.randint(10, 650, 100),
'spending': np.random.choice(np.arange(0,100,0.01)*10000, 100)
})
Figure 3. Closer look into the synthetic campaign data.

We have 100 observations where 20 of them are the treated group. Now, our objective is to find the best match for each of the 80 members of the untreated, and after that we can estimate the average effect of the treatment. But before that, let’s get to know some mathematical notations and terminologies that’s widely used in causality studies.

Mathematical Terms

One measure of interest that we are looking after is average treatment effect (ATE), which is the average outcome difference between the treated and the untreated. The estimator is denoted as

where:

  1. E[…]is the expectation function that takes the average of a random variable,
  2. Y¹ and Y⁰ are the potential outcome of the treated and untreated group, which in this case is the total spending during the campaign period.

We already know that the untreated might be having different characteristics with the treated, so it’s somehow invalid to compute the above values directly.

However, one alternative way is to compute the average treatment effect of the treated (denoted ATT) that formally stated as (where T=1 represents treatment group),

which implies that we only make the calculation over the treated group. Keep in mind we need to remember these following terms whenever we talk about causal inference:

  1. Y¹ | T = 1 which is called factual and it is the spending of the treated,
  2. Y⁰ | T = 1 which is called counterfactual and it is the spending of the treated had they not clicked the campaign channels.

However, we cannot quantify the counterfactuals in real life, because they will never happen hence the name counterfactual. This is what commonly called the fundamental problem of causal inference that roughly says; we will never be able to observe both Y¹ and Y⁰ altogether as either one of them only exists in a parallel universe.

Now, the matching technique comes to solve this problem, specifically to approximate Y⁰|T=1. Intuitively, the ATT estimator can be expressed as simply as,

where

  1. N_treated is the number of treated observations,
  2. Y_i is the spending of the treated observation ith,
  3. Y_j(i) is the spending of the untreated observation jth that matches with treated ith.

Finding the matches itself can be quite daunting to do. Normally, one way to do this is to find the nearest neighbour of each unit by using Euclidean norm as a distance measure. This means we need to scale the covariates before applying the norm, as variables like account age will have more importance than number of past transactions if left unscaled.

After we find the “twin” of each of the treated and compute the ATT estimator, we also need to present the confidence interval. It is shown that the variance estimator is as follow

if and only if there are any untreated members that are being used more than once as a match to the treated, otherwise we only use the first term. Here K_i is the number of times that observation i in the untreated is used as a match. On the other hand, var hat (epsilon| X_i, T_i=0) can be estimated via matching using

since it is an unbiased estimator of var hat (epsilon| X_i, T_i=0); that is we find observation j within the untreated such that it has similar characteristics with observation i.

Results

Let’s dive into the data now! Firstly, we find the match of each of the treated units using the nearest neighbour algorithm with Euclidean norm by having the confounders to be scaled beforehand.

from sklearn.neighbors import KNeighborsRegressor
treated = df_norm[df_norm.clicked==1]
untreated = df_norm[df_norm.clicked==0]
# Fit the data using KNN regressor where X are the confounders and y is the spending
nn = KNeighborsRegressor(n_neighbors=1).fit(untreated[X], untreated[y])
# Find the match using the fitted KNN model
matched = df[df.clicked==1].assign(user_id_untreated=untreated.loc[(mt0.kneighbors(treated[X])[1]+len(treated)).flatten().tolist()]['user_id'].values, spending_untreated=nn.predict(treated[X]))
# Normally you’ll only need the predict method but I also provide the confounders for us to be able compare visually between groups
matched = matched.merge(df, left_on=['user_id_untreated', 'spending_untreated'], right_on=['user_id','spending'], suffixes=('','_match')).drop(['user_id_untreated', 'spending_untreated'],axis=1)
cols = sorted([c for c in matched.columns if 'user_id' not in c and 'clicked' not in c])matched = matched[['user_id', 'user_id_match'] + cols]
matched.sample(10)

And the matching results should look like the following data frame. Do you think they match quite well?

Figure 4. Matching results using nearest-neighbour algorithm.

Rather than inspecting the matching results one by one, we can check the confounders distribution before and after the matching from density plots below,

Figure 5. Past transactions distribution of the treated and untreated groups before and after the matching process.
Figure 6. Account age distribution of the treated and untreated groups before and after the matching process.
Figure 7. Gender distribution of the treated and untreated groups before and after the matching process.

While the gender looks barely changed, both the account age and past transactions distribution of the untreated appear to mimic the treated distributions. Finally, we can compute the estimated ATT along with its confidence interval.

# Compute the estimated ATT
est_att = np.mean(matched['spending'] - matched['spending_match'])
# To compute the variance of the estimate, matches from the untreated group that were used more than once need to be listed down first
multiple_matches = (matched[matched.duplicated(subset=['user_id_match'], keep=False)].groupby('user_id_match', as_index=False).size().rename(columns={'user_id_match':'user_id'})).merge(untreated, on='user_id')
# Then we find the twin of these “more than once” treated matches on the untreated group
nn_var = KNeighborsRegressor(n_neighbors=1).fit(untreated[~untreated.user_id.isin(multiple_matches[‘user_id])][X], untreated[~untreated.user_id.isin(multiple_matches[‘user_id])][y])
multiple_matches['spending_match'] = nn_var.predict(multiple_matches[X])
# Computing the variance, breaking it down to two terms
first_term_var = np.mean((matched['spending'] - matched['spending_match'] - est_att)**2)
second_term_var = sum(multiple_matches['size']*(multiple_matches['size'] - 1)*(multiple_matches['spending'] - multiple_matches['spending_match'])**2/2)/len(treated)
est_var = first_term_var + second_term_var
# Finally, compute the 95% confidence interval with n1 + n2 - 2 degrees of freedom
from scipy.stats import t
lb = est_att + t.ppf(0.025, 2 * len(treated) - 2) * np.sqrt(est_var)
ub = est_att - t.ppf(0.025, 2 * len(treated) - 2) * np.sqrt(est_var)
print(“The estimated ATT is {} IDR”.format(est_att))
print(“The 95% confidence interval is between {0} and {1}”.format(lb, ub))

which should output an estimated ATT of 32,595 IDR with 95% confidence interval between -1,067,277 IDR and 1,132,894 IDR.

Then we can report, “we estimate the causal effect of the thematic campaign to be 32,595 IDR higher, but we cannot be certain since it might be due to statistical fluke”.

Definitely, we have done much better than using the three tempting options earlier due to the nature of the observational data that we have. And it appears the campaign did bring a beneficial effect to the company.

However, notice that the confidence interval is reasonably wide due to the small synthetic dataset that we use, in addition to the spending for the treated and the untreated groups that was generated from the same distribution that makes the difference appear indistinguishable. Hopefully, we would expect a narrower confidence interval (hence a more reliable ATT estimate) if we use a larger dataset, where each group also has a true contrasting distribution of spending.

Conclusion

Congratulations, you made it to the end! You have finally familiarised yourself with an observational data case example that can be solved with a causal inference technique!

There is actually one more part to be discussed where it says that the last ATT estimator above is still biased due to the matching discrepancies that we had and it needs to be corrected. I leave it to you to read through the references link below and edit the code above in case you are keen to learn further about it.

Furthermore, matching has a limitation where it will face the curse of dimensionality problem when the added number of confounding variables grows indefinitely. Later on, we will discuss another method to overcome this issue. Stay tuned!

References

Scott Cunningham. Causal Inference: The Mixtape. https://mixtape.scunning.com/index.html

Matheus Facure Alves. Causal Inference for The Brave and True. https://matheusfacure.github.io/python-causality-handbook/landing-page.html.

Hernan, Robins. Causal inference: What if. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.

Credit

Special thanks to Pararawendy Indarjo who helped to proofread this article.

--

--