Linear Regression for Causal Inference

A deeper dive into correlation vs causation.

Alison Yuhan Yao
CodeX
6 min readFeb 6, 2022

--

Photo by Sunder Muthukumaran on Unsplash

Correlation does not imply causation.

Yes, you’ve probably heard this many times. I’ve heard it again and again throughout my undergraduate training in Data Science that it has left a mark on my soul for eternity. But it turns out that linear regression can do more than just prediction (using correlation). It can also help us make causal inferences! Maybe the relationship between correlation and causation is more complicated than we originally thought.

In this blog, let’s take a look at:

  • how to interpret causal graphs,
  • what confounder and spurious relationship are, and
  • how to conduct causal inference with linear regression through a simple example in R.

Causal Graph

We can represent a causal relationship in a causal directed acyclic graphs (DAG), or a causal graph. For example, this is a causal DAG.

Image by Author

A and B are 2 variables. The arrow from A to B indicates that A causes B. That’s pretty much it. DAGs paint a clear picture of your assumptions of the causal relationship between variables. And before you know it, the causal graph can get pretty complicated.

# a bit of R code
library(dagitty)
library(ggdag)
dag <- dagitty("dag{ A -> B; C -> B; A -> C; D -> C; E -> C; F -> E; F -> D; G -> A; G -> C; E -> A; }")
ggdag(dag)
Image by Author

However, as the name directed acyclic graph (DAG) suggests, no matter how complicated the causal graph gets, there are 2 rules:

  1. Any edge can only be unidirectional, or one-directional, meaning A and B cannot cause each other. Because time only flows one way. There is always a temporal sequence between A and B. However, sometimes there can be bidirectional arcs, but that means the 2 variables are dependent but the cause is unclear [1].
  2. There are no loops in the graph, meaning it is impossible for A to cause B, B to cause C and then C to cause A (or any loop that involves more variables). For the same reason as above, if A happens first and C happens the last, there cannot be an edge pointing from C to A. [2]

Therefore, for variables that are correlated and not necessarily causal, the edge in the causal graph may or may not exist. If it does, the direction is undetermined. If we do believe a correlation can lead to a possible causation, we usually have a strong theoretical foundation to back up our assumptions.

Now, we can use causal graph to understand confounder and spurious relationship.

Confounder & Spurious Relationship

Confounders, or confounding variables/factors, is a third variable that influences both the independent variable and the dependent variable. The false relationship between the independent and dependent variables that does not actually exist is therefore called a spurious relationship.

Here, we introduce variable C as the sole confounder for independent variable A and dependent variable B. Suppose C causes A and B, then we can get rid of the edge between A and B.

Image by Author

However, most of the time, the edge between A and B is not so easy to cross out. Normally, there wouldn’t be a sole confounder. There would be a long list of confounders from C1 to Cn and we can hardly find them all.

A classic example is that A = ice cream sales and B = crime rates. Statistics show that ice cream sales are highly correlated with crime rates, but does it mean selling more ice cream causes more crime? Your sanity probably tells you no, it doesn’t. So, what could be the confounder C in this scenario?

Answer: Heat in the hot summer weather.

Image by Author

It is easy to see how heat can cause ice cream sales to rise. As for crime rates, perhaps people are more impulsive when the weather is hot and therefore more likely to commit crimes. We can speculate, but it is hard to prove that. That is why the causal graph is a representation of our assumptions. We cannot say much for certain by observing data without doing experiments. Nevertheless, the causal graphs are simple and useful enough to differentiate key processes of human logic.

Example with R

Finally, we can start coding. In this project, we investigated the relationship between alcohol consumption and the prevalence of violence against women. The datasets involved here are:

  1. Prevalence of violence against women from 2019 OECD study
  2. Alcohol consumption data from The World Bank
  3. Poverty headcount data from The World Bank
Image by Author

The plot is quite strange. We would expect more alcohol consumption to associate with, or even contribute to, a higher prevalence of violence against women, but the slope of the line is negative and indicates the exact opposite.

To be more precise, we can run a univariate linear regression.

reg_uni <- brm(formula = Prevalence ~ Alcohol_pc,                              
data=df,
refresh = 0,
seed = 123) # stabilize outcome for reproducibility
summary(reg_uni)
Image by Author

We would expect alcohol to cause more violence against women, but the data shows us otherwise. This led us to question if there might have been a confounding variable causing this spurious relationship.

After doing some literature review, we chose poverty as our target and used the poverty headcount ratio as the indicator for poverty. This indicator showed the percentage of each country’s population that was living below the national poverty line. We thought that an indicator specific to each country was better than a global indicator because it took into account the different definitions of poverty that each country has. Now, we can use a multilinear regression to see if poverty headcount ratio might be a confounder.

reg_multi <- brm(formula = Prevalence ~ Alcohol_pc + 
poverty_headcount_ratio,
data=df,
refresh = 0,
seed = 123) # stabilize outcome for reproducibility
summary(reg_multi)
Image by Author

This time, our best estimate showed that one liter increase in total alcohol consumption was associated with a 0.96 percentage point decrease in prevalence, with an uncertainty interval ranging from -1.65 to -0.22.

Compared to the earlier estimate, including poverty headcount ratio reduced the size of the association by roughly a third (1.52 to 0.96). Therefore, the original association can be partially explained by the inclusion of poverty_headcount_ratio as a control variable.

To draw a causal graph, we can run the following code.

library(dagitty) 
library(ggdag)
vaw_dag <- dagify(Prevalence ~ ph + ac,
ac ~ ph,
labels = c("Prevalence" = "Prevalence",
"ph" = "Poverty Headcount",
"ac" = "Alcohol Consumption")
)
# ggdag(vaw_dag, text = FALSE, use_labels = "label") # left graph
ggdag(vaw_dag, text = TRUE) # right graph
Image by Author

Every time you run the code, the graphs are going to be slightly different. I am not fully satisfied with either version. The labels on the left is covering up the arrowhead and the text on the right one is incomplete. You can run a couple of times and try to find the most perfect version.

References

[1] https://en.wikipedia.org/wiki/Causal_graph

[2] https://en.wikipedia.org/wiki/Directed_acyclic_graph

All code for this project can be found in this GitHub Repo.

Special thanks to Professor Kubinec for introducing me to causation and my wonderful teammate Oscar for helping me with project implementation.

Thank you for reading! I hope this has been helpful to you.

--

--