Understanding downstream impact calculation as part of causal inference analysis

Using propensity score matching (PSM) or Mahalanobis distance matching (MDM)

Mochamad Kautzar Ichramsyah
CodeX
11 min readJul 9, 2023

--

Photo by Alina Grubnyak on Unsplash

What is the downstream impact (DSI) calculation?

Downstream impact calculation or DSI refers to an effect that results from any decision we made that could be a benefit if the effect is positive, or a loss if the effect is negative. Usually, we are using experiments such as A/B testing to know the effect that will be made if we implement new treatments, so we could know beforehand, whether it will bring benefits or losses. For some reason such as expensive cost, no guarantee to prevent contamination between each group (control vs treatment), and disturbing the equilibrium which means there is a possibility if we improve a specific product in our platform it might affect the other similar product negatively, we can not always use A/B testing as a method to be done before the product launched 100%.

In these conditions, there is a method called downstream impact calculation (DSI), which is explained above, there are many methods to perform this analysis. In this post, I want to share three methods that my team is using to solve problems in our company. In this post we are going to explore these methods using an example: we will investigate the effectiveness of a premium subscription program for a retail company with online and offline stores.

Online and offline retail stores example

Photo by Austin Distel on Unsplash

Let’s say you are working in the company mentioned above and you wanted to test the effect of a premium subscription program on your company.

First of all, it’s hard to randomize individual users to be chosen as the treatment group. Second, it’s very expensive to test this, the subscription program itself is not free and it will give the users more benefits such as free shipping cost for online purchases, and additional cashback by the percentages of the amount purchased, which will make the cost through the ceiling.

I generate a dummy dataset that observes a daily transaction in the company, all by means to make the explanation easier to understand.

Image 1 by author. PS: In real-world cases, the dataset will not be as simple as this. 😅
Image 3 by author. Is there any difference before and after the subscription program launched on 1 June 2023?
Image 4 by author. The revenue amount boxplot grouped by consumer type

As we can see from Image 3, the difference in the daily revenue amount before and after the subscription program launched is not noticeable. From Image 4, using the boxplot to visualize the spread of the revenue amount by each consumer type:

  1. The before-launch vs after-launch vs all-time, the median value is higher before launch, but it did increase a little for an all-time, can we say that the launch of the subscription program makes us get a lower revenue? At the same time, the all-time says it increased. Hmm~ 🤔
  2. The non-subscribers-after-launch vs subscribers , the median value is a little bit higher for subscribers, but still based on previous information, it’s lower than before launch. Using this information, we could be more specific, can we say that the launch of the subscription program makes us get a lower revenue? At the same time, the all-time says it increased and when split into non-subscribers and subscribers, it tells us the revenue amount is decreased. Hmmmm~ 🤔 🤔
  3. The non-subscribers-all-time vs all-time , the median value is higher all time which includes the subscribers, can we say whether the launch of the subscription program makes us get more money or not or what but why??? Hmmmmmm~ 🤔 🤔 🤔

I know it’s confusing. My team also got confused because of this, we are trying to get a better method to calculate the effect of the subscription program launch, does it benefits us or not, or is it a loss?

Limitations on the traditional approach

Before vs after comparison

  1. Easily biased due to Seasonality
  2. There are almost NO metrics that are constant over time
  3. The difference is not always noticeable
  4. Subscribers' performance in the ‘After’ period might already increase/decrease compared to the ‘Before’ period event without the subscription program
Image 5. Before and after program launched compared.

Subscribers vs Non-Subscribers

  1. Selection Bias
  2. Users who chose to subscribe can already be substantially different from other users who chose NOT to subscribe.
Image 6. Bias occurred if we directly compare subscribers and non-subscribers.

Twin pairing using propensity score matching (PSM)

  1. Find users from the non-subscriber group (control group) who have similar characteristics with users from the subscriber group (treatment group) a.k.a the twin pair
  2. Similar characteristics are defined by having similar baseline variables/covariates
  3. Only after finding the twin pairs, we can proceed to calculate the impact by comparing the performance of the control vs treatment group.
Image 7. Treated vs control, looking for the nearest point.

After doing some research, we decide to use PSM to make a “twin-pair” from each user group, non-subscribers and subscribers. The simple explanation for this would be we need to:

  1. Look for a twin pair of thesubscribersand non-subscribers that have similar characteristics whenbefore-launch, such as purchase_count, purchase_amount, account_age, product_category_affinity, and many more, depending on the features that you have and their relevance to the topic you are trying to solve.
  2. Let’s say we got A and A’, where A is the non-subscribers and A’ is the subscribers that have similar characteristics to A at before-launch, we have to get twin pairs as many as we can, with the goal to have better representation to know the effect of the subscription program.
  3. In assumption, if there is no subscription program, the A’ will have similar behavior as A. Based on that, we can assume the difference between A and A’ after-launch is happening because of the launch of the subscription program, sound make sense?
  4. For the last step, we calculate the average difference between all (A, C, E, G, I, …) we can find, and voila, we get the impact estimation of the launch of the subscription program

Baseline variables: Covariates

  1. Covariates are variables that affect users to receive treatment and also impact the target metrics.
  2. Covariates selection is driven by domain knowledge.
  3. After finding twin pairs who have similar values on their covariates, we can focus on seeing the impact of Joining PLUS (subscription) or not with users spent.
Image 8. Baseline variables (covariates) example to be used

The next question is: how can we decide which users are the best twin pair? There is a method called propensity score matching (PSM). The idea is to generate a score for each non-subscribers and subscribers , after that, we set them as a twin pair if their score is near enough by balancing the PSM result for each variable used to generate the PSM.

How to measure Distance

After having the appropriate covariates, we need to calculate the ‘distance’ between a subscriber and sets of non-subscribers and find the closest one to become the twin.

When measuring the distance, one usually uses Euclidean distance, which is really intuitive to understand. But it suffers at least two problems:

  1. It calculates the distance in arbitrary units
  2. It doesn’t compensate if the variables have a correlation between each other
Image 9. Example of how to calculate Euclidean distance

It’s really easy to do this by using the MatchIt package in R, a nonparametric preprocessing for parametric causal inference. The explanation about this package and how to use it is very clear in this documentation. In short, it’s using logistic regression to generate the PSM which will be used to decide which will be a twin pair based on their groups non-subscribers or subscribers.

But, there is a limitation to the PSM approach as one of the matching approaches, it has been criticized many times, and become less recommended to imply causality. The main reason to look at other methods is that the balancing results are not still good enough. Most of the baseline variables still have Standardized Mean Difference (SMD) to be near or more than +/- 0.1. This indicates that there is still an imbalance left in our data. One of the possible reasons for the bad matching result is that our propensity score model was not good enough.

Image 10 from here. Explaining the distribution of the propensity scores.
Image 11 from here. Assessing the balance using absolute SMD.

Twin pairing using the Mahalanobis distance method (MDM)

To overcome the previously mentioned problem, we use another method called Mahalanobis distance matching (MDM). The main difference between MDM and PSM is that we no longer use proxy variables (which is the propensity score), but we rather use all the baseline variables and calculate the distance between each subscribers user with non-subscribers users. The distance being calculated is not a regular cartesian distance, but rather a Mahalanobis distance.

Image 12. The formula to calculate Mahalanobis distance
Image 13. Visualization of how to calculate Mahalanobis distance

Instead of using Euclidean distance, we could use Mahalanobis distance which has advantages as follow:

  1. It calculates distance as a unitless measure by normalizing the distance with the covariance matrix. Similar to calculating Z-score but for multi-dimensional space.
  2. Since it has been normalized with the covariance matrix, the same Euclidean distance will have a smaller Mahalanobis distance if it is in line with the correlation direction, compared to the same Euclidean distance but perpendicular to the correlation direction.

In short, Mahalanobis distance is calculating a Z-score for two or more dimensions. This is useful particularly if we have strong covariance between each variable. High covariance means that there is a strong correlation between each variable. If a high correlation does exist, the same cartesian distance between two pairs of points might have different Mahalanobis distance. The distance between two points that are located in line with the direction of the correlation will have a smaller Mahalanobis distance compared to the distance between two points that are located perpendicular to the direction of the correlation. This approach calculates a fairer distance between each point, which in our implementation will result in much better twin-pair estimation. The non-subscribers users twins are those who have the closest Mahalanobis distance with a subscribers user.

Using MDM, we were able to generate a more accurate matching. This is indicated by a minimal SMD difference for every baseline variable. As you can be seen on the love plot below, all of the baseline variables have SMD close to 0.

Evaluate the matching results

After finding the twin pair we can evaluate how good is our matching process in several ways:

  1. Statistical graphs such as density/box plots to show before and after Matching condition
  2. Comparing the Standardized Mean Difference (SMD) before and after Matching

A successful Matching process is indicated by a similar density plot between the control and treatment on each covariate. And the SMD after matching is ranging from -0.1 to 0.1

Image 14. SMD comparison before and after matching.

Calculating Impact

Finally, after having a good match result we could proceed to calculate the impact of the subscription program (treatment) on the outcome, without any intervention from the baseline variables.

To estimate the impact we could run one of the following:

  1. t-test for difference in means
  2. g-computation
Image 15. Source: Github, kathoffman/causal-inference-visual-guides. How to calculate g-computation.

Additional sharing: Synthetic control method (SCM) also as part of the causal inference analysis

SCM is a statistical method used to estimate causal effects from binary treatments on observational panel (longitudinal) data. SCM is a technique to create an artificial control group by taking a weighted average of untreated units in such a way that it reproduces the characteristics of the treated units before the intervention (treatment).

The SCM acts as the counterfactual for a treatment unit and the estimate of a treatment effect is the difference between the observed outcome in the post-treatment period and the SCM’s outcome. SCM allows us to do causal inference analysis when we have as few as one treated unit and many control units and we observe them over time. These untreated units combined will create a synthetic unit or synthetic control unit.

For the best explanation about this method, you can read this https://towardsdatascience.com/understanding-synthetic-control-methods-dd9a291885a1, I have learned a lot from there, and I think there is no better explanation to understand SCM.

Conclusion

Photo by Priscilla Du Preez on Unsplash

In this article, we have explored a great method to calculate downstream impact with a better approach, which is:

  1. Propensity score matching (PSM), which is optimized by using
  2. Mahalanobis distance matching (MDM)
  3. Additionally, the Synthetic control method (SCM)

The advantage to explore these methods to calculate the effect of changes in your product is it’s carefully assigning which users need to be checked, set as twin pair, before calculating the effect. And yes, of course, it can be your primary option when A/B testing is not feasible to calculate the effect of any changes you made in your product.

Thank you for reading!

Also, I want to say thanks to Abdul Rachim Winata, Ahmad Yusuf Albadri, Philip Thomas, Rajeev NCSTR, Gaurav Khanna, and many others that helped my team to learn and use these methods to solve a lot of problems in our company and also their help, review, and feedback to post this article.

I am learning to write, mistakes are unavoidable, even when I try my best. If you find any problems/mistakes, please let me know!

--

--

CodeX
CodeX

Published in CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Mochamad Kautzar Ichramsyah
Mochamad Kautzar Ichramsyah

Written by Mochamad Kautzar Ichramsyah

Data analytics professional with 10 years of experience at tech companies in Indonesia.

No responses yet