How long should an online article title be? There’s a blog here citing an old post from 2013 which shows a nice plot for average click-through rate (CTR) and title length.
Looking at this plot, we might suggest a policy intervention that all titles should be 16–18 words long! But wait ,there’s a nuance here. Was this data from observation or from a randomized experiment? Did someone randomly adjust the lengths of titles, or whoever prefers to write 16–18 word titles wrote 16–18 word titles? Maybe those authors also happen to be good at writing articles?
As it turns out, this plot is based on observational data. If so, then we really should control for the author when we make a plot like this. This is because all of the observed effect could be due to the authors’ skill and not just the article’s title length. How do we account for that? To start, note that the above is a plot of E[Y|X=x], or the average value of Y at x, where Y is the CTR and X is the title length. What we really want is an estimator for E[Y|do(X=x)] where do(X=x) refers to the action of actively changing X to a specific value . That is, we generate a similar plot, but where we intervene to make all authors write titles of length x (and do the same hypothetical intervention at each x). With the same authors, any trend that we see in the plot then has to be due to the length of articles’ title and thus is a causal trend.
Our question, therefore, lies in the realm of causal inference. However, most techniques for causal inference  are designed to estimate more specific things, such as the effect of changing the title to be one word longer, E[Y|do(X=1)] — E[Y|do(X=0)]. Why can’t we just make a causal version of the plot?! Wouldn’t it be great if we could generate the same data we used for this plot from our observational data, but make it causal? With modern causal inference approaches, we can!
This post is about a toolkit that makes it easy to draw causal plots. There’s a reason why we do not see such causal plots in typical data science posts: causal inference is hard! First, it requires immense statistical knowledge and expertise to estimate a single causal effect, let alone a causal version of a plot like above. Second, the software for estimating causal effects typically requires specialized frameworks that do not integrate well with common data science practices. For the first, we utilize a recently released python library for causal inference, DoWhy that abstracts causal inference in four steps and guides non-experts towards deriving the desired causal estimate. For the second, we introduce a new API for causal inference that integrates directly with pandas.DataFrame, one of the most popular tools for data analysis. Our motivation is that you don’t have to move outside of your standard data science workflow to do causal inference. Instead of calling df.plot(x=”X”, y=”Y”), you can call df.causal.do(…).plot(x=”X”, y=”Y”)!
We call this the causal data frame. We’ll run through a quick example, but first let us give a little more context on the problem of causal inference.
The promise of Pearlian causal inference
Pearlian causal graphs  have fundamentally changed the we frame causal inference problems, and have lately been changing the process of causal inference itself. Especially when it comes to software, they give an essential simplicity that lends itself to good abstraction.
DoWhy, a Python package authored by Amit Sharma and Emre Kıcıman from Microsoft, aims to realize that potential. It’s built with causal models as a fundamental data structure. It uses these to explicitly specify our assumptions (“what we know”), build an inference plan (“what to estimate”), and to construct estimation methods (“how to estimate”). Afterwards, this framework provides natural methods for checking robustness of the estimated effect and model criticism. Thus, DoWhy provides a great environment in which to take causal inference a step farther. All of the hard parts of causal inference are abstracted, and act as tools with which we can build higher-level features. This includes the concept of a causal graph on which we can use logical rules  to decide if we’ve controlled all confounders, and refuter methods which test the assumptions we take in our inference approach. With these in hand, it’s easy to alert the user when assumptions break down using Python’s logging capabilities and prevent mistakes in estimating the causal effect. Back to our example, we can illustrate the difference between observational and interventional data on title length using the two causal graphs shown in Figure 1.
Building on DoWhy and an early implementation in Adam’s causality package, we decided to build a more general class of causal effect estimators. While classic causal inference often focuses on binary causal states, and estimation of contrasts of the form
Pearlian causal inference focuses on estimating far more general quantities, like the distribution P(Y|do(X=x)). This works, in theory, even when X and Y are multivariate, and with mixed data types! This was the starting place for the do-sampler. If we could generate samples from this distribution, we could compute statistics of those samples, even plots like the one we started the article with!
The Do-Sampler: An example
Let’s keep the motivation going with a simple example, and see the do-sampler in action. Let’s solve our problem above by constructing a data set where the observational result is similar to the plot above (full notebook here, if you’d rather get to the point). Specifically, let’s have an author who prefers a narrow range of title lengths between around 12 to 18 words long, and who is twice as good as the average person at writing titles. Other authors tend to write random length titles, ranging from around 3 to 25 words long. Naturally, the 12 to 18 word titles will perform better on average, and we get a graph that looks like the figure below.
Even though there’s no causal relationship between title length and click-through rate, there’s statistical dependence! Again, that’s because a person who writes 12 to 18 word titles tends to also write titles that are clicked on more. Intuitively, if we control  for the author, the relationship between click-through rate and title length should go away.
We can do that by calling the do-sampler to produce a random sample from the interventional distribution. That is, the distribution of click through rates conditional on a policy that sets the title length to a specific length for all authors.
To do so, the do-sampler requires us to specify the cause (‘title_length’), the outcome (‘click_through_rate’), and a list of common causes that we believe can confound the relationship between the cause and outcome. In our case, let us assume that the ‘author’ is the only confounder for the effect of titles. Then, we pick a method (‘weighting’ or importance sampling as described below), and specify the variable types (here “d”, meaning “discrete”, and “c”, meaning “continuous”). The package does the rest! The result of this is a new pandas.DataFrame.
The simplicity of the do-sampler is that you can effectively manipulate the interventional distribution just as you would a dataframe! To make a similar plot as above, you can run any plotting methods you like, like the pandas native version or the seaborn version.
Whatever your preferred method, you will get a plot that looks like the one below.
The relationship between click-through rate and title length is gone! In particular, the big bump around 12–18 words due to confounding by “author” went away. The expected value of click-through rate doesn’t change with the length of the title, and now encourages us to think of other ways of improving CTR.
This is a powerful approach!
In contrast, most causal inference methods are built around estimating a parameter of a model. The simplest version of this is estimating the coefficient of a linear regression model . The coefficient, through a happy accident of the model specification, ends up being an estimator for E[Y|do(X=1)]-E[Y|do(X=0)]. This approach can be very limiting: you might be able to identify a difference in average outcomes between a control and test group, but not a difference in median outcomes, or a difference in variances, etc. The do-sampler presents a completely different and extensible approach to causal inference. We take advantage of a fundamental realization of Pearlian causal inference: that we can identify the interventional joint distribution, P(Y|do(X=x)), and from that we can compute any statistic we like!
The Do-Sampler: Why it works
The core problem is that it’s really hard to estimate probability distributions. This makes good sense: conditional distributions are multivariate functions whereas contrasts, the focus of many classic estimators, are just a single parameter.
Instead, we turn to generating samples from these distributions that allows us to compute any quantity with those samples as we could with our original data set! That’s the key idea that lets us do things like plot E[Y|do(x)] vs. x in the example above.
As it happens, there are some cool tricks for doing this sampling process! One simple trick is inverse propensity weighting. The intuition is that we weight each input data point inversely to its probability of receiving a particular treatment x. We can calculate a new average CTR at a fixed title length with this weighting scheme, where under-represented points (ones with lower propensity) get up-weighted (by the inverse propensity!).
In the case of our article titles, we’d look at how likely different titles lengths (X) are for each author (Z). For fixed title length, we’d count that author’s CTRs (Y) more times toward the average CTR at that title length if they produced fewer titles at that length. This over-counting balances the overall average out to what it would have been if we forced everyone to write titles of that length from the beginning!
If you weight your data using these inverse propensity weights, you can compute the average of the outcomes E[Y|do(X)], as if the outcomes were generated by P(Y|do(X=x)). The above is true whenever we observe all confounders for the effect of X on Y.
The math works like this: we can write the expectation of some function of Y (for us, the CTR) under the interventional distribution like
giving the familiar propensity score in the denominator. Now, we can add an estimator for P(Y, X, Z) using the counts of data points with values Y, X, Z as N(Y, X, Z), and the overall count of data points as N. We get,
and in a slight abuse of notation, we can re-write this count as,
which is our final estimator for E[f(Y)|do(X)]. This is just a weighted average of f(Y) over the data! You can get the same result instead with a re-sampling process, where each data point is sampled with inverse propensity weights, 1/P(X=xi|Z=zi). That means we can compute the expectations of arbitrary functions by computing them over a weighted random sample of the data, and taking a simple average of the resulting values. This is just the same as sampling from P(Y|do(X))! We can extend the logic to compute arbitrary aggregations on these weighted random samples.
In the dowhy package, we implement the do-sampler using three different methods: simple weighting, kernel density estimation, and monte carlo methods. The do-sampler supports both continuous and discrete variables.
The Usual Caveats
As with any causal inference method, the do-sampler also relies on modeling assumptions. Estimating the above equation requires a model for computing the propensity scores, P(X|Z) for each desired causal state X. Further, we assume that we observe all common causes that may confound the effect of X on Y. If the propensity score model is mis-specified or if there are unobserved confounders, you’ll still get biased estimates of whichever statistic you’re estimating. Using non-parametric density estimates does not get around this problem. The weighting sampler, for instance, uses kernel density estimation for continuous causal states. In the tails of empirical distributions, this approach can over-estimate propensities, and so weights that should be very large might end much smaller than their correct values. Similarly as you might constrain weights in a weighting approach to causal inference (to reduce variance), it might make sense to ignore unlikely observed causal states and focus on a conditional treatment effect (CATE) for a subset of the data, or an analogous estimand.
The good news is that DoWhy provides robustness tests to check many of these assumptions. Our next step will be to integrate refutations in the do-sampler to make it easier to catch errors in modeling or estimation.
- For an overview of causal inference techniques, check out a tutorial by Emre and Amit: https://causalinference.gitlab.io/kdd-tutorial/
- Check the Book of Why for an introduction. Or dive into the Causality book if you are brave enough!
- These logical rules are based on the do-calculus by Judea Pearl.
- To be precise, we condition on the author variable here. Controlling refers to intervening so that an author write articles of different lengths as in a randomized experiment.
- The source for the do-sampler is available at https://www.github.com/Microsoft/dowhy.