Causal Inference with Linear Regression — Foundation

The important foundation of causal inference: OLS 5+1 Assumptions, Endogeneity, Omitted Variable Bias, Irrelevant variable.

6 min readJan 27, 2024

Introduction

To learn causal inference, linear regression is an essential foundation. However, I have found that many people have a limited understanding of the assumptions of linear regression and are often confused about how the bias and variance of OLS estimators change in different situations. Therefore, I wrote this article to clarify confusion on the following topics:

OLS estimator is unbiased, consistent, BLUE (Best Linear Unbiased Estimators) with 5 assumptions, what means? How to check these assumptions?
What is omitted variable bias & including irrelevant variable?
What is Endogeneity?

OLS estimator is unbiased, consistent, BLUE (Best Linear Unbiased Estimators) with 5 assumptions

β: true parameter, constant, unknown
β^: OLS estimates, random variable (which means β^ have a probability distribution), through Central Limit Theorem, as sample size (n) increased, Ε(β^) → β

Assumption 1 — Linearity in parameters

→ OLS estimates β^ is also linear

Assumption 2 — Random sampling

Observed data is i.i.d (independent and identically distributed) random sample from population, which is greatly representative.
In theory, if we repeatedly and infinitely randomly sample the population, each sample data’s β^ forms the probability distribution of β^, then Ε (β^) = β.
Although we usually have only one observed dataset, but if this observed data is through random sampling, β^ ≈ Ε (β^) = β

Assumption 3 — No perfect multicollinearity

X must be full rank, then closed-form solution of OLS estimates exists

If assumption 1, 2, and 3 are satisfied, then OLS estimates β^ is unbiased (E (β^) = β)

Assumption 4 — Zero conditional mean

Ε (ε|Χ) = 0, ε is error term
Ε (ε|Χ) = 0, then Cov(X, ε) = 0, as sample size increases to inifinity, OLS estimates β^ → β (β^ converges to β), which means OLS estimates β^ is consistent.

If assumption 1, 2, 3 and 4 are satisfied, then OLS estimates β^ is unbiased and consistent (as sample size increases to infinity, β^ → E (β^) = β)

Assumption 5 — Hemoscedasticity and No auto-correlation

ε error term have constant variance and i.i.d (independent and identically distributed)
In mathematics, it means that in variance-covariance matric of ε, diagonal values are equal, all off-diagonal values are 0
According to the matrix operation derivation of the Gauss-Markov theorem, it can be proven that under the conditions of homoscedasticity and no autocorrelation, the variance of OLS estimates is smaller than that of any other linear unbiased estimators, then OLS estimates is also efficient (the most efficient, the best, the lowest variance).

If assumption 1, 2, 3, 4 and 5 are satisfied, then OLS estimates β^ is unbiased, consistent and efficient (best). Therefore, we can says OLS estimator is BLUE (Best Linear Unbiased Estimators)

Assumption 6 (Optional) — Normality of error

Normality of ε → Normality of β^
Without this assumption, OLS estimator is still BLUE, why we need?
Normality of error makes OLS estimator has reliable standard error, which means we have not only accuracy (point estimates), but precision (variability), then statistical inference has evaluative significance: 1) calculate p-value in hypotheis testing; 2) build reliable confidence interval.

But why “Normality of error” is optional?

Based on Central Limit Theorem, regardless of whether the errors term or the data of each variable itself follow a normal distribution, if sample size is large enough, the distribution of β^ approximately follows a multivariate normal distribution.

How to check OLS 5+1 assumptions?

Residual Analysis

Var (ε) = σ² is unknown (from population), we use variance of residuals Var (e) = s² to approximate (from sample). s² is the unbiased estimator of σ².

Residuals vs. Predicted response variable (Plot): check whether there is non-linearity, non-consistant variance or any outliers
Histogram of residuals & Q-Q plot (normal probability of residuals): check whether residuals follows normal distribution.

VIF (variance inflation factor)

Sometimes, to avoid omitted variable bias, we may want to include all possible covariates. However, this can lead to severe multicollinearity.

VIF = 1, no multicollinearity
VIF > 1, multicollinearity
VIF > 10, severe multicollinearity

Omitted variable bias & irrelevant variable

β: true parameter, constant, unknown
β^: OLS estimates, random variable (which means β^ have a probability distribution), through Central Limit Theorem, as sample size (n) increased, Ε(β^) → β
Bias: β^ ≈ Ε(β^) to β
Variance (1/precision): variability of many β^ to Ε(β^)
T: treament variable
Y: response variable

Scenarios if omit variable Z

omit Z is related to both T and Y

Z is called confounding variable, if omitted, correlation between Z and T leakes into ε error term, then T is correlated to ε error term, then T is called endogenous variables, then β^ of T has omitted variable bias

omit Z is related to Y, not related to T

β^ of T is still unbiased
Z’s effect (unexplained part) leaks into ε error term, Var (ε) increases, then variance of all β^ increased , then p-value tends to be larger, tends to be false negative.

Scenarios if including irrelevant variable X

irrelevant means X is not related to response variable Y

include X is not related to T, Y and other covariates

Tiny effect to bias and variance of all β^
However, if including too many X, degree of freedom will reduce, variance of all β^ will increase.

include X is related to T, not related with Y and other covariates

X’s effect leaks into T, β^ of T is biased

include X is related to other covariates, not related with Y and T,

β^ of T is still unbiased, Var(β^ of T) don’t change.
However, β^ of other covariates are biased, variance increases. This is called multicollinearity

Trade-off: X is related to Y and other covariates, not related to T

if we include X, variance of all β^ will increase or decrease (unstable), because there is multicollinearity (Benefit is β^ of other covariates are less biased.
if we drop X, β^ of other covariates are more biased (Benefit is there is no multicollinearity)

if we omit variable which is related to T, or we include variable which is related T, but not related to Y:
β^ of T is biased
if we include variable which is only related to other covariates:
Var (β^ of all) increases

Endogeneity

Endogeneity: In a linear regression model, if T or other covariates X is related to ε error term (three type of bias), then T or other covariates X is called endogenous variables, β^ of endogenous variables is biased
Use T to refer to the independent variable of interest in the study.

Omitted variable bias

omit confounding variable Z (Z is related to T), then T is related to ε error term, T becomes endogenous variable, β^ οf T has omitted variable bias.

Simultaneity bias

T leads to Y, and Y also leads to T, then T is related to ε error term, β^ οf T has simultaneity bias
Ex. education level — — income

Attenuation bias (measurement error)

Y has measurement error, overall error= ε + measurement error of Y, so T is not related to ε error term, T is still unbiased
T has measurement error, T is related to ε error term, T becomes endogenous variable, β^ οf T has attenuation bias (β^ of T is always underestimated, measurement errors will weaken the actual correlation between T and Y.)
Some variables are difficult to quantify and can only be replaced by proxy variables, so measurement errors always exist.

Measurement error has two types:
Random error: β^ is still unbiased, because E(β^) → β
Systemic error: β^ is biased, because error tends to be always higher or always lower, not due to random measurement.

Summary

OLS estimator is unbiased, consistent, BLUE (Best Linear Unbiased Estimators) with 5 assumptions, what means? How to check these assumptions?
What is omitted variable bias & including irrelevant variable?
What is Endogeneity?

If you find this article helpful to you, please click clap and follow to inspire me. I will publish related blogs on data science and statistical analysis regularly!

Thanks for your reading and feel free to leave comments and discuss!