Understanding Confounding Variables: A Comprehensive Guide

13 min readMar 25, 2024

Retrieved from https://twitter.com/lisachalik/status/1306048541488566279

Confounding variables pose a significant challenge in research and decision-making processes by obscuring true causal relationships between variables of interest. This article explores the concept of confounding variables, their identification, and methods for adjusting them through theoretical explanations and practical coding examples.

Understanding Variables in Causal Inference

What is Causal Inference?

Causal inference is the process of drawing conclusions about causal relationships between variables based on observational data or experimental studies. It aims to understand how changes in one variable influence changes in another variable, elucidating cause-and-effect relationships.

Introduction to Variables in Causal Inference

In causal inference, understanding variables is crucial for accurately determining causal relationships between phenomena. Let’s explore the fundamental concepts of independent, dependent, and confounding variables.

Independent Variable: The Cause

The independent variable is the variable manipulated or controlled by the researcher to observe its effect on the outcome.
Example: Consider a study evaluating the effect of a new drug on patient recovery time after surgery.
Independent Variable: Dosage of the new drug administered to patients.
In statistical models, the independent variable represents the treatment or intervention being tested for its causal effect.

Dependent Variable: The Effect

The dependent variable is the outcome variable influenced by changes in the independent variable.
Example: Continuing with the drug study, the dependent variable would be the recovery time of patients.
Dependent Variable: Recovery time after surgery.
In statistical models, the dependent variable is the response variable affected by variations in the independent variable.

Confounding Variable : The Hidden Influence

Confounding variables are external factors that affect both the independent and dependent variables, leading to erroneous conclusions about causality.

Example: In a study evaluating the effectiveness of a new drug, a confounding variable could be the patients’ age. Age may affect both the metabolism of the drug and the recovery time, thus influencing the observed relationship between drug dosage and recovery time.

Confounding variables introduce bias into causal inference models, making it challenging to isolate the true effect of the independent variable. Failure to account for confounding variables can lead to inaccurate conclusions about the relationship between the independent and dependent variables.

Analyzing Relationships Between Variables

This code example demonstrates a scenario involving three variables (X, Z, and Y) with a causal relationship and a confounding variable. The goal is to visualize and understand how the relationships between these variables are influenced by the underlying causal structure.

We hypothesize that X influences Y. Additionally, we introduce a confounding variable Z, which may affect both X and Y, potentially biasing our analysis if not properly addressed.

By generating random data based on this causal structure and visualizing the relationships using scatter plots, we aim to gain insights into the interplay between variables and the impact of confounding on causal inference.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Define the causal structure and generate random data
np.random.seed(42)

# Z influences X
Z = np.random.rand(100)
X = Z + 0.2 * np.random.rand(100)

# Z influences Y
Y = Z + 0.3 * np.random.rand(100)

# Visualize the relationships between X, Z, and Y
plt.figure(figsize=(12, 8))

# Scatter plot of X and Z
plt.subplot(2, 2, 1)
plt.scatter(Z, X, color='blue')
plt.xlabel('Z')
plt.ylabel('X')
plt.title('Relationship between Z and X')

# Scatter plot of Z and Y
plt.subplot(2, 2, 2)
plt.scatter(Z, Y, color='green')
plt.xlabel('Z')
plt.ylabel('Y')
plt.title('Relationship between Z and Y')

# Scatter plot of X and Y
plt.subplot(2, 2, 3)
plt.scatter(X, Y, color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Relationship between X and Y')

plt.tight_layout()
plt.show()

Plots showing relationship between variables

Interpretation

Relationship between Z and X:

Linear Relationship: The scatter plot between Z and X exhibits a clear linear trend, where increases in Z are associated with corresponding increases in X, and vice versa.
Strong Association: The points in the scatter plot closely align around the linear trendline, indicating a strong association between Z and X.
Little Scatter: The points cluster tightly around the trendline, suggesting minimal variability or scatter in the relationship between Z and X.

Relationship between Z and Y:

Linear Relationship: Similar to the Z-X scatter plot, the Z-Y scatter plot also shows a linear trend, indicating that increases in Z are associated with increases in Y, and vice versa.
Moderate Association: While the points in the scatter plot follow a linear pattern, there is slightly more variability compared to the Z-X plot, suggesting a somewhat weaker association between Z and Y.
Moderate Scatter: The points in the scatter plot exhibit some dispersion around the trendline, indicating moderate variability in the relationship between Z and Y.

Relationship between X and Y:

Linear Relationship: The scatter plot between X and Y also shows a linear trend, suggesting that increases in X are associated with increases in Y, or vice versa.
Weaker Association: Compared to the Z-X and Z-Y scatter plots, the X-Y scatter plot exhibits more variability and scatter around the trendline, indicating a weaker association between X and Y.
Increased Scatter: The points in the scatter plot are more dispersed around the trendline, indicating greater variability in the relationship between X and Y. This increased scatter suggests the presence of additional factors or confounding variables influencing the relationship between X and Y.

from mpl_toolkits.mplot3d import Axes3D

# Create a 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot the data points for X, Y, and Z
ax.scatter(X, Y, Z, c='b', marker='o')

# Set labels and title
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('3D Scatter Plot of X, Y, and Z')

plt.show()

This plot indicates a linear relationship between our variables X, Y and Z.
Also the points are strongly clustered around the plane showing there’s a strong association between X, Y and Z.

Requirements for Confounding Variables

Correlation with the independent variable

A confounding variable must be correlated with the independent variable (the variable of interest).
Confounding variables often share a relationship with the independent variable, meaning changes in one variable are associated with changes in the other. This correlation can lead to associations between the independent variable and the dependent variable being falsely attributed to the confounding variable instead of the true causal relationship.

2. Causal Relationship with the dependent variable

A confounding variable must have a causal relationship with the dependent variable (the outcome of interest).
Confounding variables influence the outcome being studied. They can directly affect the dependent variable, leading to changes in its values independent of the independent variable. If this causal relationship is not accounted for, it can distort the observed associations and lead to incorrect conclusions about causality.

Why Confounding Variables are problematic

Misinterpretation of Cause and Effect:

Confounding variables can create the illusion of a cause-and-effect relationship between the independent and dependent variables when none exists. This misinterpretation can lead to incorrect conclusions about the true causal mechanisms underlying the observed associations.

Masking True Relationships:

Confounding variables can mask or obscure the true relationships between the independent and dependent variables. By introducing bias into the analysis, confounders may distort the observed associations, making it challenging to discern the genuine effects of the independent variable on the dependent variable.

Threat to Internal Validity:

Confounding variables pose a threat to the internal validity of a study, which refers to the extent to which observed changes in the dependent variable can be attributed to changes in the independent variable. When confounding variables are present and not adequately controlled for, it becomes difficult to establish a causal link between the independent and dependent variables with confidence.

Difficulty in Causal Inference:

Identifying and adjusting for confounding variables is essential for accurate causal inference. Failure to account for confounding can lead to biased estimates of causal effects and undermine the validity of research findings. Addressing confounding variables requires careful study design, data collection, and statistical analysis techniques to minimize their impact on the results.

Identifying Confounding Variables

(A) Constraints for Identifying Confounding Variables:

Association with Risk Factor (Y): A variable, X, is considered a confounding variable if it is associated with the risk factor (Y) in the control group. For example, in a study investigating the association between smoking (Y) and lung cancer, if a confounding variable like air pollution (X) is also associated with smoking prevalence among non-cancer patients, it satisfies this constraint.
Association with Outcome (Z): X should also be associated with the outcome (Z) in the absence of Y. For instance, if air pollution (X) is independently associated with lung cancer (Z) among non-smokers, it meets this criterion.
Not an Intermediate Step: X should not be an intermediate step between the risk factor (Y) and the outcome (Z). In our example, if air pollution directly influences lung cancer risk, independent of smoking, it would violate this constraint.

(B) Test of Association by Stratifying on X:

This method involves stratifying the data based on the potential confounding variable X and then comparing the odds ratios (ORs) between the risk factor (Y) and the outcome (Z) within each stratum.
For instance, if we stratify data based on air pollution levels (low, medium, high) and calculate the OR for lung cancer (Z) associated with smoking (Y) within each stratum, we can assess if the strength of association remains consistent across strata.
If the ORs are similar within strata and significantly different from the overall crude OR, it suggests that air pollution is a confounding variable.

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Function to assign smoking status based on age
def assign_smoking_status(age):
    if age < 40:
        return np.random.choice([0, 1], p=[0.8, 0.2])
    else:
        return np.random.choice([0, 1], p=[0.2, 0.8])

# Generate synthetic data
n = 1000
age = np.random.randint(20, 81, size=n)
smoking = np.array([assign_smoking_status(a) for a in age])

# Create DataFrame
df = pd.DataFrame({'Age': age, 'Smoking': smoking})

# Define age groups
age_bins = [20, 40, 60, 80]
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins)

# Define a function to calculate odds ratio within each stratum
def calculate_odds_ratio(df, stratifying_variable, outcome_variable):
    odds_ratios = {}
    for group, group_data in df.groupby(stratifying_variable):
        num_cases = group_data[outcome_variable].sum()
        num_non_cases = len(group_data) - num_cases
        odds_ratio = num_cases / num_non_cases if num_non_cases > 0 else np.inf
        odds_ratios[group] = odds_ratio
    return odds_ratios

# Calculate odds ratio within each age group for smoking
odds_ratios = {}
for age_group, group_data in df.groupby('AgeGroup'):
    odds_ratio = group_data['Smoking'].sum() / (len(group_data) - group_data['Smoking'].sum())
    odds_ratios[age_group] = odds_ratio

# Print results
for age_group, odds_ratio in odds_ratios.items():
    print(f"Age group {age_group}: Odds Ratio = {odds_ratio}")

The output provides the odds ratio for smoking within different age groups.

For the age group (20, 40], the odds ratio is 0.244, indicating a decreased likelihood of smoking among individuals in this age range compared to the reference group.
For the age group (40, 60], the odds ratio is 3.689, suggesting an increased likelihood of smoking among individuals aged between 40 and 60.
For the age group (60, 80], the odds ratio is 3.333, indicating a similar increased likelihood of smoking among individuals aged between 60 and 80.

These odds ratios help quantify the association between smoking and age within each group, providing insights into how smoking behavior varies across different age ranges.

The presence of a confounding variable can be inferred from the variation in the odds ratios across different age groups. In this case:

If the odds ratios were consistent across all age groups, it would suggest that age does not confound the association between smoking and the outcome (e.g., lung cancer).
However, since the odds ratios vary significantly across age groups, with notably higher odds ratios in certain age ranges, it indicates that age may be influencing the association between smoking and the outcome.

This suggests that age is a potential confounding variable in the relationship between smoking and the outcome.

(C) Causal Diagrams (Directed Acyclic Graphs, DAGs):

Causal diagrams visually represent the hypothesized causal relationships between variables in a study. DAGs depict the causal pathways and help researchers identify potential confounding pathways and variables.
DAGs provide a graphical representation of the relationships between variables, allowing researchers to identify which variables should be adjusted for in statistical models to minimize confounding bias.

This is a DAG with three variables (X, Y, Z) and their directional relationships. Each edge represents a causal relationship, where X causes Y and Z, and Y causes Z. That means, Y is our independent variable, Z is our dependent variable and X is the confounding variable.

Adjusting Confounding Variables — Methods and Code Examples

Adjusting for a confounding variable means considering its impact when analyzing the relationship between the independent and dependent variables. It involves accounting for the confounder’s influence to isolate the true effect of the independent variable on the dependent variable. This adjustment helps researchers obtain more accurate estimates of the association between variables by controlling for potential biases introduced by the confounding variable.

1. Stratification:

Stratification involves dividing the data into homogeneous groups based on the confounding variable.
Within each stratum, the association between the independent variable and the outcome variable is assessed separately.
The stratum-specific effect estimates are then combined to obtain an overall effect estimate, adjusted for the confounding variable

Example: Suppose we have a dataset with variables Age, Medication, and Outcome. We want to assess the association between Medication and Outcome, adjusting for Age as a potential confounder.

With the below code, we simulated a dataset representing individuals’ age, medication status, and corresponding outcomes. We assumed that older individuals and those receiving medication have higher probabilities of a positive outcome. This dataset will be used to demonstrate methods for adjusting confounding variables.

import numpy as np
import pandas as pd

# Define sample size
n = 100

# Generate random data
age = np.random.randint(20, 80, size=n)
medication = np.random.randint(0, 2, size=n)

# Initialize an empty list to store outcomes
outcomes = []

# Iterate over each individual
for i in range(n):
    # Define outcome probabilities based on age and medication for this individual
    outcome_prob = 0.2  # Default probability
    if age[i] > 50:
        outcome_prob += 0.6
    if medication[i] == 1:
        outcome_prob += 0.1
    # Generate outcome for this individual based on computed probability
    outcome = np.random.choice([0, 1], p=[1 - outcome_prob, outcome_prob])
    outcomes.append(outcome)

# Create DataFrame
df = pd.DataFrame({'Age': age, 'Medication': medication, 'Outcome': outcomes})

# Display the first few rows of the DataFrame
print(df.head())

# Let's create age groups: 18-35, 36-50, 51-65, 66+
df['Age_Group'] = pd.cut(df['Age'], bins=[18, 35, 50, 65, np.inf], labels=['18-35', '36-50', '51-65', '66+'])

# Stratify the data based on age groups
strata = df.groupby('Age_Group')

# Conduct analysis within each stratum
for group, subgroup in strata:
    treatment_effect = subgroup[subgroup['Medication'] == 1]['Outcome'].mean() - subgroup[subgroup['Medication'] == 0]['Outcome'].mean()
    print(f"Treatment effect in {group} stratum: {treatment_effect}")

In this code, we’re stratifying the dataset based on age groups and then analyzing the treatment effect within each age group separately. This allows us to control for the potential confounding variable (age) and assess the treatment effect within homogeneous subgroups.

The output suggests that the treatment effect varies across different age groups. In the 18–35 and 36–50 age groups, the treatment led to a decrease in the outcome compared to the control group, while in the 51–65 and 66+ age groups, the treatment resulted in an increase in the outcome compared to the control group. This variation underscores the importance of considering age as a potential confounding variable and adjusting for it in the analysis to obtain more accurate estimates of the treatment effect.

2. Matching:

Matching involves pairing individuals who are similar with respect to the confounding variable(s).
Each treated individual (exposed) is matched with one or more control individuals (unexposed) who have similar values of the confounding variable(s).
The association between the independent variable and the outcome is then assessed within the matched pairs.

Example: Continuing from the previous example, we want to match individuals based on their Age and then assess the association between Medication and Outcome.

from sklearn.neighbors import NearestNeighbors

# Find nearest neighbors based on age for each individual
knn = NearestNeighbors(n_neighbors=1).fit(df[['Age']])
distances, indices = knn.kneighbors(df[['Age']])
df['Matched_Outcome'] = df.iloc[indices.flatten()]['Outcome'].values

# Calculate odds ratio for the matched pairs
matched_odds_ratio = (df[df['Medication'] == 1]['Matched_Outcome'].sum() / len(df[df['Medication'] == 1])) / \
                     (df[df['Medication'] == 0]['Matched_Outcome'].sum() / len(df[df['Medication'] == 0]))

print("Adjusted Odds Ratio (Matching):", matched_odds_ratio)

The adjusted odds ratio after matching is approximately 1.21. This suggests that after matching individuals based on age, the odds of the outcome (e.g., recovery) for those receiving medication compared to those not receiving medication increased by about 21%.

3. Regression Adjustment:

Regression adjustment involves including the confounding variable(s) as covariate(s) in a regression model.
The regression model estimates the association between the independent variable and the outcome while controlling for the confounding variable(s).

Example: Continuing from the previous examples, we want to fit a logistic regression model to assess the association between Medication and Outcome, adjusting for Age.

import statsmodels.api as sm

# Create design matrix X (including intercept)
X = sm.add_constant(df[['Medication', 'Age']])

# Fit logistic regression model
logit_model = sm.Logit(df['Outcome'], X)
logit_result = logit_model.fit()

# Print summary of logistic regression model
print(logit_result.summary())

The output suggests that after adjusting for both age and medication, age remains a statistically significant predictor of the outcome, with each additional year associated with an increase in the log odds of the outcome by approximately 0.0771. However, the effect of medication on the outcome is not statistically significant, indicating that there is insufficient evidence to conclude that medication has a significant impact on the outcome after accounting for age. The model as a whole fits the data well, as indicated by the low p-value for the LLR test.

Conclusion

Confounding variables play a crucial role in shaping research outcomes and influencing decision-making processes across various disciplines. Through the exploration of theoretical concepts and practical examples, this article has shed light on the complexities of confounding variables and the challenges they present in causal inference. By employing rigorous methods for identifying and adjusting confounders, researchers and policymakers can mitigate bias and obtain more accurate insights into causal relationships. Moving forward, continued efforts to address confounding variables will contribute to advancing the reliability and validity of research findings, ultimately leading to more effective interventions and policies in diverse fields.

References

What is a Confounding Variable? (Definition & Example) Statology. https://www.statology.org/confounding-variable/
How to control confounding effects by statistical analysis. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4017459/
Confounding and Effect Measure Modification https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704-ep713_confounding-em/bs704-ep713_confounding-em_print.html#:~:text=Identifying%20Confounding&text=In%20other%20words%2C%20compute%20the,little%2C%20if%20any%2C%20confounding