When and how to apply causal inference with RDD

Intuition, step-by-step script, and assumptions needed for the RDD

7 min readSep 24, 2024

Several business or public policies determine the assignment of a treatment based on the value of a specific variable. Examples include credit risk scores for bank limits, minimum age to vote, minimum grades to obtain a certification, etc.

In these cases, the Regression Discontinuity Design (RDD) emerges as an intuitive and robust method for estimating causal effects, offering advantages such as:

Milder assumptions compared to other observational methods.
Possibility of inference as robust as that obtained through experimental designs, although it occurs only under specific scenarios.

Today, I will discuss the intuition behind the sharp RDD case, the necessary hypotheses for its application, and a hands-on discussion. There is also the fuzzy case, which is an application of instrumental variables. This will be covered in a future post.

Intuition

The essence of RDD is to compare individuals who are immediately around a critical value of an interest variable. For example, imagine that a e-commerce company wants to test how much, on average, customers increase their future purchases once they give them free shipping. The rule to receive free shipping is having an average loyalty score greater than 50 in the last 3 months.

The idea is that these individuals around this critical value, which we call the cutoff or threshold, are similar in almost all characteristics except for being on one side or the other of the cutoff point.

This allows minimizing omitted variable bias (OVB, see here about it), which occurs when unobserved factors influence both the likelihood of receiving the treatment and the outcome.

In the context of RDD, some terms are too important to be left undefined, so let’s define them before continuing:

Assignment rule: Defines who receives the treatment based on the cutoff variable.
Running or forcing variable: The variable according to which the assignment rule is established (in this case, it’s the average loyalty score in the last 3 months).
Cutoff or threshold: The value of the running variable that determines the assignment of the treatment (in this example, the value is 50. The average loyalty score must be >50 for the customer to receive free shipping).
Bandwidth: The interval around the cutoff considered in the analysis. For instance, in the figure below, the purple vertical lines denote a possible bandwidth around the cutoff of 50

To avoid biases resulting from comparing clients that are too different from each other or from incorrect model specifications (such as capturing nonlinear relationships that do not correctly reflect a true discontinuity), it is common to use only observations close to the cutoff. This approach is known as local or non-parametric RDD.

The choice of bandwidth involves a trade-off:

Smaller bandwidths reduce bias but increase the variance of estimates due to a smaller number of observations.
Larger bandwidths increase the risk of bias but reduce the variance of estimates due to a larger number of observations.

There are statistical methods to assist in the optimal choice of bandwidth, such as those proposed by Calonico, Cattaneo, and Titiunik (2014) and Calonico, Cattaneo, and Farrell (2020).

Key assumptions and limitations

The unbiasedness of the causal estimate via RDD depends on some assumptions:

No manipulation of the running variable: Individuals should not be able to deliberately influence their position relative to the cutoff. If they can, this introduces bias into the estimate.

Example: If customers can easily increase their loyalty score to obtain free shipping, the comparability of both sides of the cutoff would be compromised.

Therefore, checking for a balanced distribution of observations around the cutoff is important. An abnormal concentration of individuals on one side of the cutoff may indicate manipulation.

Continuity of observable variables: the observable characteristics of individuals should vary continuously around the cutoff. This means there should be no abrupt jumps/discontinuities in the covariates across the threshold.

The above test is an indirect test for the continuity of the potential outcome: The outcome that would be observed in the absence of the treatment should vary smoothly around the cutoff (see more about this topic in this post). Although the hypothesis of continuity of the potential outcome is not directly testable (since we do not observe the counterfactual outcome), we can perform indirect tests to increase our confidence in the results.

Limitation: The local RDD has high internal validity, as the effect is well estimated for customers around the loyalty score of 50 (cutoff). However, external validity is limited because the observed effect may not apply to customers with loyalty scores that differ significantly from 50.

For customers with loyalty scores of 30 or 70, for example, the effect of free shipping could be quite different, and the local RDD does not allow for a reliable inference of this.

Hands-on with Sharp RDD using rdrobust

I provide here an R script for this application. Let’s apply sharp RDD in a practical example: measuring the impact on future purchases of offering free shipping to customers with a loyalty score above a certain cutoff.

An e-commerce company wants to evaluate whether offering free shipping to customers with an average past-loyalty-score equal to or greater than 50 increases their future spending on the platform. The loyalty score varies continuously among customers and is calculated based on past interactions, such as purchase frequency and engagement.

I simulated a dataset representing the described situation:

Running variable (x): Loyalty score, normally distributed around 50.
Outcome (y): Total future spending in R$. I introduced a jump of R$250 in y for customers with x ≥ 50 as an effect of receiving free shipping.
Covariates: Age, tenure as a user, and frequency of past purchases.

Test of manipulation of the running variable

Let's start by validating the key assumptions we described. Let's check if customers may manipulate their loyalty score to get free shipping. The figure below shows the density of the loyalty score is continuous around the cutoff, not indicating manipulation.

Interpretation: If customers could easily increase their score to surpass the cutoff, we would have observed an abnormal concentration of scores just above 50. The continuity of the density suggests that this is not occurring.

Test of continuity of covariates

Analogously, we should analyze whether customer characteristics vary smoothly/continuously around the cutoff point. The results below show no significant discontinuities in the covariates, reinforcing the assumption that the groups are comparable.

Interpretation: This means that characteristics such as age, tenure, and frequency of past purchases do not differ significantly between customers just above and just below the cutoff, ensuring that the observed effect can be attributed to the treatment (free shipping) and not to differences in these characteristics.

Estimation of the Causal Effect

Now that we validated the main assumptions, we should apply the RDD to estimate the impact of free shipping on future spending, for those around the cutoff of loyalty score.

Interpretation: The estimated coefficient in the figure below represents the jump in future spending at the cutoff, i.e., the causal effect of free shipping on future purchases. A robust p-value less than 0.05 suggests that the observed effect is statistically significant, assuming the model assumptions hold.

Therefore, this estimation indicates a significant increase in future spending of R$240 for customers who received free shipping, suggesting that the policy was effective in encouraging additional purchases.

Formal visualization of RDD: We use function rdplot for a graphical representation of this discontinuity: The graph shows the discontinuity in future spending exactly at the cutoff, visualizing the positive effect of free shipping.

Now that you’re familiar with RDD sharp, let’s stay in touch for more discussions with coding.

Thank you for reading. Follow me for more in this series :)

To see more, follow me here, where I share posts about causal inference and career. Would you like to support, me? Just share this with those who may be interested!

PS: If you spot any errors or have suggestions, I’d be happy to hear from you.

Here are some (more theoretical) materials I recommend:

Calonico, Cattaneo and Titiunik (2014): Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs.
Econometrica 82(6): 2295–2326.
Calonico, Cattaneo, Farrell and Titiunik (2019): Regression Discontinuity Designs Using Covariates.
Review of Economics and Statistics 101(3): 442–451.
Calonico, Cattaneo and Farrell (2020): Optimal Bandwidth Choice for Robust Bias Corrected Inference in Regression Discontinuity Designs.
Econometrics Journal 23(2): 192–210