A comprehensive guide on Causal Inference in Retail — Part I

Shivangi Choudhary
9 min readNov 12, 2023

--

In this article I will be discussing the below mentioned topics, which forms the base for Part 2 , where I have talked about different causal inference techniques used for panel data, and is relevant for real life retail scenario where we have daily/weekly sales , across various products. Let the learning begin!

Topics

  • What is Causal Inference
  • Why Causal inference
  • Challenges
  • Assumptions while implementing Causal Inference Techniques
  • Different Causal inference techniques with examples from offline and online retail

What is Causal Inference

Causal Inference is one of the ways to establish accurate relationship between a cause and effect , since causation doesn’t guarantee correlation. Traditionally causal inference models have been used in industries like health care, political science, bio statistics , social media etc. In this article .Knowing these relationships between cause and effect is important, because a lot of business/medical/socio political decisions are made based on these inferences, like whether to give drug A or drug B to a set of patients with similar history, whether to launch campaign A or campaign B in order to get optimize profit, whether to make amends in state taxation laws or not .In this article ,I would focus on different causal inference techniques for measuring the attribution from marketing and operational initiatives across retail and ecommerce industries which uses panel data in most scenarios and answers questions like

  • what would have been the effect on the margin if the price markdown recommendations was not implemented
  • what is the incremental revenue due to the new feature on the website
  • was this initiative able to improve the customer satisfaction score( or any other metric of interest like conversion, issue resolve time, average trip duration ) etc.

According to Judea Pearl’s ladder of causation, we start our analysis with modeling for associations. This is where most of the machine learning models comes into picture. The next step is to intervene the system so get incremental benefits , with same budget/resources/effort etc . And the final stage is creating the what-if scenarios , or what I prefer to call Synthetic scenarios, in model for even those cases, which the system will never observe , in order to uplift the revenues further. Causal inference techniques is helpful in accurately measuring the effect of interventions , in the second and third stages of the ladder.

Judea Pearl’s Ladder of causation

Lets dive in!

Why Causal Inference

If we have the luxury of conducting Random experiments, making sure that we are controlling for all the variables, then we can use different A/B testing to draw conclusions about cause and effect. But unfortunately many times we cant use Randomized Control Trials (RCTs), mostly in situations where:

  • Setting up the experiment might not be possible — ex promotional billboards and marketing banners
  • Experiments might take too long to make conclusions
  • Sometimes its unethical to set up random (RCTs) , esp in health care
  • State laws keeps changing like taxation etc
  • It might be costly to run an experiment

Quoting some of the examples based on above categories include offering differential services to few customers , showing additional features to some visitors on the website based on their demographics /past behavior, testing a new algorithm on some buyers/sellers to resolve their queries for an ecommerce site etc. Hence alternative techniques are used to make conclusions about the effect of treatment. In these scenarios , we want to make conclusions based on the observational data.

Causal Inference refers to set of statistical techniques which are used to determine the true treatment effect when Randomized trials are not possible to conduct , i.e it is used to capture the effect of treatment in observational studies or clinical trials. To explain it further, its basically trying to capture the effect of an intervention or treatment , on the outcome .

We can use causal graphs to explain the relationship. Ex. In a Potential Outcomes Framework, accroding to Rubin — Neyman Causal model :

Each unit(Xi) has two potential outcomes:

  • Observed Factual outcome
  • Unobserved Counter factual outcome

The challenge is , we only have 1 observation for each individual/unit, since each unit will either get a treatment or not get a treatment , and so we don’t have an identical set of Xi, getting different treatments. so we need to impute the counterfactual (identical twin outcome) , in order to get the actual treatment effect. This brings us to some of the challenges while implementing Causal techniques, which I will be covering next.

Challenges related to Causal Inference Techniques

  • Confounders — Variables that impact both the treatment and the outcome variable, and is not easily visible or very intuitive
  • Selection Bias — When a unit/individual selected to be part of Treatment group, isn’t a good representation of all the people in population, which can happen due to segments, demography, socio economic factors etc
  • Finding Counterfactual — Since we only ever observe only one of the two outcomes for any unit/individual, we need to find as similar Xs as possible to Xi , to make sure that we are comparing apples to apples

Assumptions while implementing causal inference techniques

In order to overcome some of the challenges mentioned above and accurately measure the causal effect , some assumptions are made . This is an attempt to tailor the prior data to make it as representative to randomized control test sample. There can always be some confounders , which can have weird and unintentional effect on the outcome ,which we can never control for. The assumptions which make the inferencing with causal data possible include

  • Causal Markov Condition — Also known as Robustness assumption, it states that a variable X is independent of every other variable (except X’s effects) conditional on all of its parents and descendants
  • SUTVA — Stable unit treatment value assumption -It ensures that only one potential outcome is observed for each individual, and there exists as many possible outcomes as the number of treatment values ( mostly binary, 0 or 1)
  • Ignorability — This assumption states that all the variables (Xi) affecting both the treatment (Ti) and the outcome (Yi) are observed and can be controlled for
Controllability Assumption
  • Common support — Also known as Positivity assumption, it states that for every segment or group in the data, all kinds of treatment are present in segment or group , basically outcomes are independent of the treatment once the features are controlled for.

Types of causal inference Techniques:

In retail, estimating the effect of marketing promotions on sales/customer’s average transaction value or conversion is very important. We have event and non-event periods in the date . During promotional Events, targeting selected customer bases through marketing email/sms can be considered as an intervention. In all such scenario, Causal inference techniques aims to calculate ATE and ITE

ATE = Average Treatment Effect (of the group that received the treatment vs the group that did not receive it)

ITE = Independent Treatment Effect

Lets discuss different scenarios in which we can calculate these metrics.

  1. Meta Learning Techniques

This approach refers to taking the difference in the outcome when the unit/individual is exposed to the treatment vs not exposed to treatment

ITE = P(Outcome|Treatment) — P(Outcome|Not Treated)

= P(Yi =1|Xi,Wi =1) — P(Yi =1|Xi,Wi =0) = [0,1]

where,

Yi = {0,1} Does this person make a purchase

Xi = Lead vector = Characteristics of the person

Wi = Did the person received the marketing email or not

In reality, a person will either receive a treatment or not, but not both . So the goal is to calculate ITE

2. Two model Approach

Model 1 — Probability of leads converting if given a treatment

Model 2 — Probability of leads converting if not given a treatment

ATE = Model 1 — Model 2

The expected causal effect of T on Y = ATE = E [Y1 — Y0]

Calibration is then applied to reduce the unbalance in the data for these models

3. Class Transformation Approach

In this approach, another target variable is created, to target persuadables ( which refers to a set target audience whose behavior can be influenced towards positive outcome , as opposed to other segments like sleeping dogs, lost causes or sure thing)

Persuadables = Zi = Yi*Wi + (1-Yi)(1-Wi)

P(Zi|Xi) = probability of being persuadable

ITE = 2*(P(Zi =1|Xi) — 1

where,

Yi = {0,1} Does this person make a purchase

Wi = Did the person received the marketing email or not

4. Covariate adjustment

In this method, we extrapolate the observations to answer what if a particular individual/unit would have received the treatment

Response surface modelling predicts Y given X and T , by translating it as a regression problem, and is used to measure both ITE and ATE — t is given as an input/intervention, and we find the difference in Y , between f(x,t) when t =0 and t =1 (raw data doesn’t show up here)

  • Alternatively, observed Y for the counterfactuals and imputed Y for the counterfactuals using f, this would also have been the consistent estimator for ATE
  • Since function f, might not have learned properly, especially in high dimensional data, ML algorithm might not actually never learn from actual dependence in T, if we have to regularize or do early stopping. But we don’t actually care about quality of predicting Y ,T is important here. We are interested in T, so that if we change T, we can observe the differential effect in Y (we don’t want model to completely ignore T, and just model on Xs ) .The difference in goals become important as the number of dimensions increase
  • In this approach , overlap is important , there should be similar samples in the data

5. Inverse Propensity score re-weighting

The main idea here is to turn observational study into a pseudo randomized trial , by re-weighing the samples. Propensity scores gives the probability of receiving the treatment given the confounders

Propensity score measures P(T=1,X) , using ML techniques

Samples are thein re-weighted by the inverse propensity scores of the treatment that they have received. We should inspect the distribution of propensity scores in treatment and control groups. If there is not much overlap, then propensity scores become non informative and can easily be miss calibrated. Weighing by inverse can create large variance and large errors for small propensity scores

6. Propensity Score Matching

We first calculate the propensity score, which gives the probability of receiving the treatment given the confounders. Its about matching a unit/individual to its nearest neighbor from opposite group. Finding similarity using propensity scores, is much easier than finding the matching based on various confounders( multi dimensions)

Matching is equivalent to covariate adjustment with 2 1-nearest neighbor classifiers. There are different matching methods like KNN, mahalanobis distance etc. Propensity score is also calculated using logit function or random forest with cross validation. Overlap in the propensity score across the range is good for matching. Once we find the closest match for each treated unit , from the untreated unit, the difference in the outcome variable can be inferred only because of the treatment or intervention.

ATE = ATT ( Average treatment effect on Treatment)

This a non-parametric technique, applicable in small sample scenarios. But in situations where we don’t find the counterfactual, we can use synthetic control method, which is explained later under panel data measurement techniques

7. Regression Discontinuity

It is a method for estimating causal effects when the relationship between the independent and dependent variables is not linear and there is a discontinuity i.e if we have to measure the difference based on a threshold value or a cut off , in order to differentiate between treatment and control units/individuals, like customers making a purchase when discount is more than 30% or it can be used to model demand elasticities at certain demand points , for certain items

As we have seen above, Causal inferencing is challenging with observational data. Panel data solves this to a great extent. A panel is when we have repeated observations of the same unit , over multiple periods of time. It has samples of same cross sectional units, observed at multiple time points. It is different than pooled data , where we do have time-series of cross sections of data, but the observations in each cross section, need not necessarily refer to same unit .In my next article , I will discuss in detail about the various ways in which we can use causal inference techniques for Panel data, which is more relevant for daily/weekly item wise sales that we have in a retail set up. Some techniques include pre-post, difference-in-difference, Fixed/Random effect , synthetic control methods etc.

--

--

Shivangi Choudhary

Senior manager ,Data science - Retail and Ecommerce Domain, aspiring to motivate others towards personal development and life skills