To avoid bias caused by confounding, you must control or condition on the right variables. But how do you know which ones?

7 min readJun 15, 2023

Observational studies look at existing, historical data and try to understand and model the interactions between variables — for example, does a drug cure a disease, or do people simply recover?

Since observational studies do not randomize the assignment of subjects to “treated” or “control” groups, there is always the risk that confounding variables affect the results of the study — for example, by making it appear that the drug has a beneficial effect, when in reality most who received the drug recovered due to better overall health. In this case “better overall health” is a confounding variable.

To correctly estimate the effect of the drug, we must control for the effect of “better overall health”. Controlling for a variable means including it as a covariate in a statistical model to account for its potential influence on the outcome of interest.

Controlling for and Conditioning on Variables

Controlling for and conditioning on a variable are two statistical techniques used to address confounding variables, giving more accurate estimates of causal effects. They are sometimes used interchangeably, which creates some confusion.

Controlling for a variable involves including the variable as a covariate in the statistical model. The model adjusts for the potential influence of the confounding variable on the outcome of interest. For example, if we want to estimate the effect of a new drug on blood pressure, we may need to control for age, as older people tend to have higher blood pressure regardless of the drug they take.

Conditioning on a variable involves stratifying the data according to the values of the variable and analyzing each stratum separately. This method effectively eliminates the confounding variable by restricting each model to a subset of the data. For example, we could condition on sex and estimate the effect of the drug separately for men and women. However, note that conditioning on variables reduces the size of the dataset in each stratum, and as the number of conditioning variables increases, each subset shrinks very rapidly.

Both controlling for and conditioning on variables can be useful in different scenarios, depending on the research question and the available data. However, it is important to note that these methods assume that confounding variables have been correctly identified and measured. If there are unmeasured confounders, then these techniques may not fully address the issue of confounding.

Controlling and conditioning on too many variables

But there’s another problem: Controlling and conditioning on too many, or the wrong variables (such as colliders — explained below) also creates bias and reduces accuracy! So how can you determine which variables should be controlled or conditioned?

The solution is to capture your understanding of the relationships — or suspected relationships — between variables. The most common format is known as a Causal Diagram, a Directed Acyclic Graph (DAG) in which the presence of an edge implies a direct causal relationship between two variables, and the absence of an edge means that no direct causal relationship exists. This is an important assumption.

Numerous tools are available to create or draw Causal Diagrams. Two of the more popular are Causal Wizard app and DAGitty. Of the two, we find Causal Wizard more convenient and user-friendly, although DAGitty offers more general DAG / graph-drawing functionality.

Alternatives to Causal Diagrams

Popular alternatives to Causal Diagrams include Bayesian Networks, Structural Equation Models (SEMs) and Structural Causal Models (SCMs).

Bayesian Networks (BN) and Causal Diagrams are both graphical models used to represent probabilistic relationships between variables. However, there are some differences between them:

Bayesian Networks represent relationships as conditional probability distributions, which may or may not accurately capture causal effects, whereas Causal Diagrams only capture the existence and direction of explicitly causal relationships;
Bayesian Networks are primarily used for probabilistic inference, whereas Causal Diagrams are used to make causal inferences;
Additional data, modelling and/or assumptions are necessary to build a Bayesian Network; typically, the conditional distributions are created from data, whereas Causal Diagrams are easy to create from prior knowledge.
You can “do” more with a Bayesian Network, such as generating quantitative inferences, but the downside is that you have to do more work to construct it.

SCMs consist of a set of Endogenous (V) and Exogenous (U) variables connected by a set of functions F that determine the values of variables in V given the values of variables in U. SCMs capture causal relationships. The functions F are very general, and usually require data to define.

SEMs are typically used to analyse multivariate causal relationships, given assumed causal relationships (i.e. you implicitly create a causal diagram to define the equations in a SEM).

Both SEMs and SCMs can be transformed into Causal Diagrams; BNs do not necessarily capture causal relations, so direct transformation may not be possible. A related area of research, known as Causal Discovery, aims to recover the topology implied by a SCM, Bayesian Network or Causal Diagram from data.

Finding the right variables to control and condition

So how does a Causal Diagram identify which variables should be controlled or conditioned?

The image below shows three common situations in a Causal Diagram: Confounding variables, Collider variables and Mediator variables. Note the direction of the red arrows, which vary between each graph:

Definitions of key variable types in a Causal Diagram: Confounders, Colliders and Mediator variables.

These 3 scenarios are the most common situations which determine whether you should control or condition on a variable.

In all these examples, there are also two special variables: Treatment, and Outcome. A treatment variable is like an independent variable — one you want to change. For example, you might want to change the treatment variable from “no drug given — control subject” to “drug given — test subject”. The outcome variable is the one you want to observe the effect on (aka dependent variable).

The remainder of this article will explain the thinking and intuition behind choosing which variables to condition or control using these variable types (treatment, outcome, colliders, confounders and mediators). We will also cover some more subjective factors (e.g. rare variable values).

Causal Wizard app will identify these variable types for you automatically when you draw a Causal Diagram. If you upload your data, it will also estimate causal effects.

You might also want to check out this two-part video which describes how to choose which variables to control / condition on using DAGitty:

How to choose variables to condition on (part 1 of 2)

How to choose variables to condition on (part 2 of 2)

Confounding variables

Do control for confounders.

Confounding variables affect both the Treatment and Outcome, directly or indirectly.

It is generally important to control for confounding variables to reduce bias and estimate effects accurately. However, there are situations where controlling for other types of variable actually creates or increases bias!

Collider variables

Do not control for colliders.

A collider variable is affected by both Treatment and Outcome, directly or indirectly, and is not on the Causal path between them. Conditioning on a collider variable can induce spurious associations and thereby bias the estimated causal effect. Therefore, controlling for a collider variable can lead to biased results.

Mediator variables

Do not control for mediators.

A Mediator variable lies on the causal path between the Treatment and Outcome. It mediates (modifies) the causal effect of the Treatment on the Outcome. Controlling for a mediating variable can block the causal effect you want to estimate. This is obviously inappropriate, and at best creates bias.

Over-controlling

Do not over-control.

Overcontrolling occurs when a variable is controlled unnecessarily, even though it is not a confounding variable or related to the causal relationship of interest. Overcontrolling can create bias, by artificially reducing variation in the treatment variable and forcing the model to learn a less generalised form of the relationship.

The desire to avoid over-controlling may conflict with detection of confounding variables in your causal diagram.

Rare variable values

Controlling for variables with rare values has pros and cons.

If the variable you want to control for has some rare values, or limited variation, controlling for it may yield poor models that do not generalise well (are overfitted), and / or biased.

However, a confounder variable may have rare values and therefore it is preferable to include it.

In cases such as this, or when avoiding over-controlling, the decision whether to control for the variable has no right answer. It may be worth performing multiple analyses with and without controlling for the variable, and comparing the results.

Unobserved confounders

It is not possible to control for unobserved confounders — because they are not observed!

These variables are not available, i.e. not present in your data, but they are in your Causal Diagram and they are confounders due to their relationships with Treatment and Outcome variables.

Unobserved confounders can prevent successful inference of causal effects. You can remove them from the graph; however, note that this creates the possibility that to some extent the effect is really due to the unobserved confounders rather than the observed Treatment.

One solution is to document unobserved confounders which you suspect may exist, but aren’t included in the Causal Diagram. You might exclude them from the diagram because:

You want to keep the diagram simple, or
Including them prevents successful identification of the causal effect you want to measure

Either way, it’s crucial you consider any inference results with the knowledge that the effects of these unobserved confounding variables have not been accounted for. In some cases, you may have prior knowledge that their effect is limited and you’re willing to ignore them.

Summary

Hopefully this article helped you to understand what controlling and conditioning on a variable means, how a Causal Diagram can help you identify which variables to control or condition on, and the factors that go into that decision. We hope you enjoy continuing to learn about causality and data science!