Introduction to Causal Inference.

Dr Peter James Winn
10 min readJan 9, 2023

--

Article 4. Getting a hook into the statistics of causality. Techniques for the analysis of causal networks.

Old School Tools for Catching a Fish. Photo by Zab Consulting on Unsplash

Causal inference includes two fundamental questions: is there a causal relationship between our observables? If so, how strong is that relationship? The do-operator and causal diagrams allow us to define the problem and do-calculus may allow us to transform the problem into one that we can address with standard statistical tools; tools that have been used for a hundred years. These tools are discussed here, readying us to see if we can recover causal information from data of known causal relationship, which we shall do in the next two articles. The next articles will thus give us a deeper understanding of how the causal inference process works. What follows here is an introduction to the statistical ideas of correlation, regression and stratification, with enough detail to understand what they are, what the results mean, why regression and stratification are similar to each other, but also why they are different. How does this all fit together in the context of causality, the prediction of observations, and predicting the effect of interventions?

Linear Regression and Causality.

Simple correlation and regression techniques can tackle the linear problems that we shall look at, including examples where confounding is important. If we include confounders in multiple regression, this can correct for confounding; provided that we don’t also include a collider, as discussed in previous articles and exemplified in Article 6. The simplest linear regression is one that determines an algebraic relationship between observations of two variables, the dependent/response variable, usually labelled y, and the independent/explanatory variable, x. The aim is to predict y from x. From a scientific perspective, we’d like y to depend on x via a direct or indirect causal link but sometimes it is sufficient to predict y, irrespective of causal connection.

Variables y and x could be two properties of an object that change together with time, observed at different time points, or they might be two properties of a group of objects, e.g., the height and weight of a number of different people. If we wish to predict categories, rather than values, logistic regression is more appropriate, e.g., whether a patient recovers from disease, but we shall not engage with that here. Our focus will be the prediction of numerical data, for which linear regression is appropriate.

At its most basic, the relationship between y and x would be ŷ = mx +c, which is a straight line with m the gradient, c the intercept, and ŷ the predicted value of y. In statistics, we add an additional term to this, giving y = mx + c + r, where r is the residual, which represents how much each prediction, ŷ, deviates from the true value of y. The residuals may indicate that an important variable is missing from the model, e.g. a sinusoidal distribution around the line of best fit suggests that a sinusoidally varying explanatory variable is missing. For a good model the residuals represent the random scatter of the true values, y, around the predicted values, ŷ = mx + c. The residuals might still represent missing components of the model, but nonetheless the probability of a particular value of y, given a value of x, can be predicted. I.e., for a given value of x, ŷ can be viewed as the mean of a Gaussian distribution from which the probability P(Y|X=x) can be calculated (Fig. 1). The idea of the regression line representing the conditional distribution of Y is important for understanding the relationship between stratification and regression, and we return to it later.

Figure 1. Regression of y against x gives a line of best fit, ŷ = mx + c, where ŷ is the mean of the Gaussian distribution for the probability of seeing y given that value of x, P(y|x). Therefore, the actual data values of y should be distributed around the line of best fit according to the Gaussian distribution for that value of x. The variance around the line should be the same for every value of x.

The simplest extension of this regression is to add further explanatory/independent variables to the model, so that y is now predicted using the values of multiple descriptors, e.g. x1, x2, x3, etc., where, e.g., y might be the weight of a person, x1 their height, x2 age, x3 amount of exercise they do per week. This gives a multiple regression line, y = m1 x1 + m2 x2 + … + mn xn + c + r. If the explanatory variables x1, x2 etc. are independent of each other, i.e. uncorrelated, their “m” and “c” values can be determined uniquely; by linear algebra, or via an optimisation technique such as steepest gradient descent. We shall not dig into the mechanics underlying these techniques, for us the important thing is that the gradients, m1, m2 …, tell us the “influence” that each variable x has over the value of y, sometimes referred to as the sensitivity that y has to a change in x. However, the “influence” represented by m may not be causal, depending on the presence of confounders. The gradients, m, will give the expected value of y for an observed x but not necessarily how y changes if we actively change x.

How complex a model needs to be depends on how accurate the prediction needs to be, and on whether the aim is to narrow in on possible causal details. A complex regression model might include squared and higher order terms, cross terms (i.e. the multiplication of two different variables together), etc.. For many purposes, an argument can be made for a model simply predicting observations well, without concern as to whether it encapsulates the causality of the system. However, focusing solely on prediction can be problematic, as seen in neural networks; prediction can be based on some obscure detail of the training set that is not transferable to similar data sets and situations, e.g. an artefact of the machines used to collect the data. This is a subtly different issue from the bias variance trade-off that some readers may be familiar with. A model that represents direct causality, in some sense, is likely to be transferable between situations.

It may be possible, and sufficient for purpose, to make predictions from a spurious (i.e. non-causal) correlation, leaving confounders out of the model, perhaps because they are difficult to measure. However, to find the causal relationships, we need to have all confounders included in the model, but not colliders, which would estimate the true influence of the model variables on each other. Whether a variable is a confounder requires insight into the system being modelled and can be difficult to determine objectively, as discussed in earlier articles. One way used to identify confounders is to perform an analysis with and without putative confounders included, to see if they change the predicted relationship. However, the inclusion of colliders will create correlation between variables where no causality exists, so this is a problematic strategy.

Sensitivity, as represented by the model gradients, m1, m2, etc., may allow us to predict the affect of an intervention on the system of interest, but care is needed in interpreting them. If a confounding variable is left out of the analysis (or collider included) then the predicted sensitivities in the model may include a contribution from the missing confounder, and thus won’t represent direct causal effects. It is also important to normalise the descriptor variables, dividing each by its variance, so that the regression coefficients have the same units, and so that magnitude of the variables covers the same range. Without normalisation the relative contributions, as determined by their gradients/sensitivities, of the different explanatory variables (X) to the prediction of the response variable (Y) are not comparable.

Including descriptive features (i.e., independent variables) that are highly correlated with each other can also introduce problems, since this gives no unique solution to the regression problem. This can be seen readily from a linear algebra solution to the problem, or by repeated solutions to a gradient descent based numerical solution, from random starting values, which will provide regression coefficients that can take an infinite number of values for the correlated variables. This is discussed more in the articles that follow this one.

Since one can rarely be sure that all confounders have been included in a model, but no colliders, we are left with uncertainty about the validity of the estimated causal contributions. Some confidence can be given in the predictive value of the model by perturbing the functional form, e.g., changing a linear term to a quadratic term, or removing or adding a variable, and seeing if the model is robust to those modifications, i.e. it still gives a good prediction. However, robustness of predictions does not confirm a good causal model, and even a model with a correct causal diagram may have residual confounding due to inaccurate measurements, which may not fully capture the contribution of every variable in the model.

Measuring Correlation.

As well as calculating a predictive relationship between the variables we usually calculate the correlation between the variables. Most commonly these are pairwise, via the Pearson or Spearman rank correlation coefficients. These give a value of 1 for perfect correlation, 0 for no correlation and -1 for a perfect correlation but with a negative gradient. Alongside these is the partial correlation, which includes multiple variables and aims to remove spurious correlation between variables. Many similar metrics exist, e.g. mutual information, but in the next article we shall only use Pearson’s pairwise correlation.

As well as measuring whether two variables, x and y, are correlated, Pearson’s correlation (r) can be used to measure how good a linear regression model is, since the predictions from a “perfect” model should correlate perfectly with the actual values of the data used for training. Here, perfect means zero residual error, which might be a result of overfitting the model to its training data, but overfitting is not an issue that will arise in our examples. So, a perfect least squares regression should have a correlation of 1 between the predictions and dependent data. In actuality, the square of the correlation coefficient is used, r squared, since this is equivalent to the coefficient of determination, which measures the fraction of the variance in the data that is explained by the regression. 1 implies that all the variance is explained and 0 implies that the model is no better than estimating y from the mean value of the data (Fig. 2). For completeness I note that for very poor models the mean value can be a better predictor than the model, in which case the coefficient of determination is negative, but further explanation of that is not needed here.

Figure 2. Showing the change in the value of the coefficient of determination with different intervariable relationship. Left: Y = X; Middle: Y = 0.5 X + 0.5 B; Right: Y and X are independent, pseudo random integers between 1 and 15, inclusive.

Stratification for confounding.

Rather than calculating a full regression model, the effect of confounding can be estimated by selecting a subset of data with one or more potentially confounding variables set to a fixed value (or a very narrow range of values). This is a process known as stratification and any spurious correlations that the stratified variable(s) may drive will be removed, as long as they are not colliders. For example, in medicine weight is often controlled for, to ensure that any difference between a treatment group and a non-treatment group is not due to differences in their weight. Similarly, age is often controlled for, to ensure that one group is not much older than another. By stratifying the data and only comparing, e.g., like weight with like weight, we don’t have to concern ourselves with the functional form that weight might play in the question being asked, unlike regression which does require us to have the correct functional form. An example illuminating this point can be found at the end of the article by McNamee, given in the further learning section.

Linear regression can be seen as a form of stratification. When we stratify, we are calculating the probability of the dependent variable given the value(s) of the stratified variables, i.e. P(Y|X=x), where X represents all stratified variables, which we assume are confounding. We do the same when we read a value from a linear regression. Y = mX + c is telling us the value of Y given that we know a value of X. I.e., for every value of X it gives us P(Y|X=x). Indeed, regression can be considered as stratification in the limit of an infinitesimally small strata and is thus usually preferred to stratification, but has the disadvantage that the functional form needs to be known to fit the regression properly. The pros and cons of both methods are documented elsewhere (e.g., McNamee).

Conclusions

Corrections for confounding/colliders are only necessary where data are observations from the population. If a randomised trial is used to test an intervention, then the randomisation process acts as a control for all confounders/colliders and, most importantly, it allows estimation of our confidence in the result. Randomised controlled trials are complex and expensive to run. Do-calculus and causal diagrams allow us to deal with colliders and confounding, and when harnessed to observational studies and classical statistics provide a potentially quicker and cheaper route to causal inference than randomised controlled trials. Such classical tools include, linear regression, measures of correlation such as Pearsons’ correlation coefficient, and stratification. However, caution is warranted, since causal inference needs understanding of the underlying causal network, and experimental confirmation of any inference is imperative.

This article has ignored issues such as, logistic regression, estimating model quality using test and training sets, methods for estimating statistical significance, and how to deal with dummy variables and collinearity in multiple linear regression, amongst many important topics. However, it does provide enough detail to prepare the reader for the next articles in the series, and provides an overview that I hope motivates readers to dig deeper into the topics presented.

Further Learning

R McNamee, Regression modelling and other methods to control confounding, Occupational and Environmental Medicine, volume 62(7); http://dx.doi.org/10.1136/oem.2002.001115

https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.htmlhttps://en.wikipedia.org/wiki/Coefficient_of_determination

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Dr Peter James Winn
Dr Peter James Winn

Written by Dr Peter James Winn

Molecular simulation, molecular biology, machine learning and artificial intelligence. https://www.linkedin.com/in/peter-winn-6875446/?originalSubdomain=uks.

No responses yet

Write a response