Understanding downstream impact calculation as part of causal inference analysis
Using propensity score matching (PSM) or Mahalanobis distance matching (MDM)
What is the downstream impact (DSI) calculation?
Downstream impact calculation or DSI refers to an effect that results from any decision we made that could be a benefit if the effect is positive, or a loss if the effect is negative. Usually, we are using experiments such as A/B testing to know the effect that will be made if we implement new treatments, so we could know beforehand, whether it will bring benefits or losses. For some reason such as expensive cost, no guarantee to prevent contamination between each group (control vs treatment), and disturbing the equilibrium which means there is a possibility if we improve a specific product in our platform it might affect the other similar product negatively, we can not always use A/B testing as a method to be done before the product launched 100%.
In these conditions, there is a method called downstream impact calculation (DSI), which is explained above, there are many methods to perform this analysis. In this post, I want to share three methods that my team is using to solve problems in our company. In this post we are going to explore these methods using an example: we will investigate the effectiveness of a premium subscription program for a retail company with online and offline stores.
Online and offline retail stores example
Let’s say you are working in the company mentioned above and you wanted to test the effect of a premium subscription program on your company.
First of all, it’s hard to randomize individual users to be chosen as the treatment group. Second, it’s very expensive to test this, the subscription program itself is not free and it will give the users more benefits such as free shipping cost for online purchases, and additional cashback by the percentages of the amount purchased, which will make the cost through the ceiling.
I generate a dummy dataset that observes a daily transaction in the company, all by means to make the explanation easier to understand.
As we can see from Image 3, the difference in the daily revenue amount before and after the subscription program launched is not noticeable. From Image 4, using the boxplot to visualize the spread of the revenue amount by each consumer type:
- The
before-launch
vsafter-launch
vsall-time
, the median value is higher before launch, but it did increase a little for an all-time, can we say that the launch of the subscription program makes us get a lower revenue? At the same time, the all-time says it increased. Hmm~ 🤔 - The
non-subscribers-after-launch
vssubscribers
, the median value is a little bit higher for subscribers, but still based on previous information, it’s lower than before launch. Using this information, we could be more specific, can we say that the launch of the subscription program makes us get a lower revenue? At the same time, the all-time says it increased and when split into non-subscribers and subscribers, it tells us the revenue amount is decreased. Hmmmm~ 🤔 🤔 - The
non-subscribers-all-time
vsall-time
, the median value is higher all time which includes the subscribers, can we say whether the launch of the subscription program makes us get more money or not or what but why??? Hmmmmmm~ 🤔 🤔 🤔
I know it’s confusing. My team also got confused because of this, we are trying to get a better method to calculate the effect of the subscription program launch, does it benefits us or not, or is it a loss?
Limitations on the traditional approach
Before vs after comparison
- Easily biased due to Seasonality
- There are almost NO metrics that are constant over time
- The difference is not always noticeable
- Subscribers' performance in the ‘After’ period might already increase/decrease compared to the ‘Before’ period event without the subscription program
Subscribers vs Non-Subscribers
- Selection Bias
- Users who chose to subscribe can already be substantially different from other users who chose NOT to subscribe.
Twin pairing using propensity score matching (PSM)
- Find users from the non-subscriber group (control group) who have similar characteristics with users from the subscriber group (treatment group) a.k.a the twin pair
- Similar characteristics are defined by having similar baseline variables/covariates
- Only after finding the twin pairs, we can proceed to calculate the impact by comparing the performance of the control vs treatment group.
After doing some research, we decide to use PSM to make a “twin-pair” from each user group, non-subscribers
and subscribers
. The simple explanation for this would be we need to:
- Look for a twin pair of the
subscribers
andnon-subscribers
that have similar characteristics whenbefore-launch
, such aspurchase_count
,purchase_amount
,account_age,
product_category_affinity
, and many more, depending on the features that you have and their relevance to the topic you are trying to solve. - Let’s say we got A and A’, where A is the
non-subscribers
and A’ is thesubscribers
that have similar characteristics to A atbefore-launch
, we have to get twin pairs as many as we can, with the goal to have better representation to know the effect of the subscription program. - In assumption, if there is no subscription program, the A’ will have similar behavior as A. Based on that, we can assume the difference between A and A’
after-launch
is happening because of the launch of the subscription program, sound make sense? - For the last step, we calculate the average difference between all (A, C, E, G, I, …) we can find, and voila, we get the impact estimation of the launch of the subscription program
Baseline variables: Covariates
- Covariates are variables that affect users to receive treatment and also impact the target metrics.
- Covariates selection is driven by domain knowledge.
- After finding twin pairs who have similar values on their covariates, we can focus on seeing the impact of Joining PLUS (subscription) or not with users spent.
The next question is: how can we decide which users are the best twin pair? There is a method called propensity score matching (PSM). The idea is to generate a score for each non-subscribers
and subscribers
, after that, we set them as a twin pair if their score is near enough by balancing the PSM result for each variable used to generate the PSM.
How to measure Distance
After having the appropriate covariates, we need to calculate the ‘distance’ between a subscriber and sets of non-subscribers and find the closest one to become the twin.
When measuring the distance, one usually uses Euclidean distance, which is really intuitive to understand. But it suffers at least two problems:
- It calculates the distance in arbitrary units
- It doesn’t compensate if the variables have a correlation between each other
It’s really easy to do this by using the MatchIt package in R, a nonparametric preprocessing for parametric causal inference. The explanation about this package and how to use it is very clear in this documentation. In short, it’s using logistic regression to generate the PSM which will be used to decide which will be a twin pair based on their groups non-subscribers
or subscribers
.
But, there is a limitation to the PSM approach as one of the matching approaches, it has been criticized many times, and become less recommended to imply causality. The main reason to look at other methods is that the balancing results are not still good enough. Most of the baseline variables still have Standardized Mean Difference (SMD) to be near or more than +/- 0.1. This indicates that there is still an imbalance left in our data. One of the possible reasons for the bad matching result is that our propensity score model was not good enough.
Twin pairing using the Mahalanobis distance method (MDM)
To overcome the previously mentioned problem, we use another method called Mahalanobis distance matching (MDM). The main difference between MDM and PSM is that we no longer use proxy variables (which is the propensity score), but we rather use all the baseline variables and calculate the distance between each subscribers
user with non-subscribers
users. The distance being calculated is not a regular cartesian distance, but rather a Mahalanobis distance.
Instead of using Euclidean distance, we could use Mahalanobis distance which has advantages as follow:
- It calculates distance as a unitless measure by normalizing the distance with the covariance matrix. Similar to calculating Z-score but for multi-dimensional space.
- Since it has been normalized with the covariance matrix, the same Euclidean distance will have a smaller Mahalanobis distance if it is in line with the correlation direction, compared to the same Euclidean distance but perpendicular to the correlation direction.
In short, Mahalanobis distance is calculating a Z-score for two or more dimensions. This is useful particularly if we have strong covariance between each variable. High covariance means that there is a strong correlation between each variable. If a high correlation does exist, the same cartesian distance between two pairs of points might have different Mahalanobis distance. The distance between two points that are located in line with the direction of the correlation will have a smaller Mahalanobis distance compared to the distance between two points that are located perpendicular to the direction of the correlation. This approach calculates a fairer distance between each point, which in our implementation will result in much better twin-pair estimation. The non-subscribers
users twins are those who have the closest Mahalanobis distance with a subscribers
user.
Using MDM, we were able to generate a more accurate matching. This is indicated by a minimal SMD difference for every baseline variable. As you can be seen on the love plot below, all of the baseline variables have SMD close to 0.
Evaluate the matching results
After finding the twin pair we can evaluate how good is our matching process in several ways:
- Statistical graphs such as density/box plots to show before and after Matching condition
- Comparing the Standardized Mean Difference (SMD) before and after Matching
A successful Matching process is indicated by a similar density plot between the control and treatment on each covariate. And the SMD after matching is ranging from -0.1 to 0.1
Calculating Impact
Finally, after having a good match result we could proceed to calculate the impact of the subscription program (treatment) on the outcome, without any intervention from the baseline variables.
To estimate the impact we could run one of the following:
- t-test for difference in means
- g-computation
Additional sharing: Synthetic control method (SCM) also as part of the causal inference analysis
SCM is a statistical method used to estimate causal effects from binary treatments on observational panel (longitudinal) data. SCM is a technique to create an artificial control group by taking a weighted average of untreated units in such a way that it reproduces the characteristics of the treated units before the intervention (treatment).
The SCM acts as the counterfactual for a treatment unit and the estimate of a treatment effect is the difference between the observed outcome in the post-treatment period and the SCM’s outcome. SCM allows us to do causal inference analysis when we have as few as one treated unit and many control units and we observe them over time. These untreated units combined will create a synthetic unit or synthetic control unit.
For the best explanation about this method, you can read this https://towardsdatascience.com/understanding-synthetic-control-methods-dd9a291885a1, I have learned a lot from there, and I think there is no better explanation to understand SCM.
Conclusion
In this article, we have explored a great method to calculate downstream impact with a better approach, which is:
- Propensity score matching (PSM), which is optimized by using
- Mahalanobis distance matching (MDM)
- Additionally, the Synthetic control method (SCM)
The advantage to explore these methods to calculate the effect of changes in your product is it’s carefully assigning which users need to be checked, set as twin pair, before calculating the effect. And yes, of course, it can be your primary option when A/B testing is not feasible to calculate the effect of any changes you made in your product.
Thank you for reading!
Also, I want to say thanks to Abdul Rachim Winata, Ahmad Yusuf Albadri, Philip Thomas, Rajeev NCSTR, Gaurav Khanna, and many others that helped my team to learn and use these methods to solve a lot of problems in our company and also their help, review, and feedback to post this article.
I am learning to write, mistakes are unavoidable, even when I try my best. If you find any problems/mistakes, please let me know!
References
- Matteo Courthoud, Understanding Synthetic Control Methods, 2022, published in Towards Data Science.
- Ahmad Yusuf Albadri, The Idea of Synthetic Control for Assessing Causal Effect on a Temporal Data with One Treated Unit, 2023.
- https://bookdown.org/mike/data_analysis/synthetic-control.html
- Estimating the Treatment Effect
- Identifying the estimand
- Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples
- Why don’t people report the accuracy, PPV, or NPV of their propensity score models?
- Agrawal, A. 2022. Causal Inference with Synthetic Control Using Python and SparseSC.
- Alves, M F. 2022. Causal Inference for the Brave and True. Chapter 15: 15 — Synthetic Control.