Understanding Linear Regression Assumptions

Qingchuan Lyu
Analytics Vidhya
Published in
5 min readNov 28, 2020

In this post, I’ll show you necessary assumptions for linear regression coefficient estimates to be unbiased, and discuss other “nice to have” properties. There are many versions of linear regression assumptions on the internet. Hopefully, this post will make it clear.

“Must have” Assumption 1. conditional mean of residuals being zero

E(ε | X) = 0 means the prediction errors of our regression is supposed not to exist (being zero) given the observed data. This is very straightforward, if you think of the definition of being unbiased: the mean of an estimator is the same as its true value. In other words, when E(ε | X) ≠ 0, we know the average of coefficient estimate of β is not β.

Visually, the scatter plot of residuals should spread equally around the zero horizontal line. Below is a case where E(ε | X) ≠ 0, because most residuals are positive:

“Must have” Assumption 2. conditional variance of residuals being constant

This assumption is more about the possibility of performing statistical tests on them, rather than their unbiasedness. In other words, if this assumption is not satisfied, we even don’t have coefficient estimates, or at least don’t have trustworthy estimates. How so?

Remember, variance of residuals, σ, is part of the variance of coefficient estimates: var(β estimates)=(X’X)’’σ, where ’ means transpose and ’’ means inverse. Therefore, when σ is not a constant, we cannot have a solid estimation of the variance of β estimates. In this case, many hypothesis testing statistics will be invalid, because they usually involve the standard error (square root of var(β estimates)). For example, we cannot check the significance of coefficient estimates, because t-test and p-value need its standard error.

“Must have” Assumption 3. No omitted variable bias

In practice, it’s almost impossible to catch all the contributing variables in a complicated case study, but we still want to avoid omitted variable bias. Why?

Remember the definition of being unbiased again:

When a contributing predictor is omitted (x in this case), the coefficient of x will be absorbed into the constant (β₀ or ε or both). This could lead to biased estimate of β₀ or nonzero conditional mean of residuals.

How to check it? A common way is to add back the suspicious omitted variable in your regression, then observe if its p-value is less than a threshold (e.g., 0.05) and if other coefficient estimates change a lot when you add it back. Either p-value < 0.05 or the other coefficient estimates change a lot means you omitted a significant variable. Another uncommon but power way is to plot Y against residuals and see if there’s a pattern, as residuals are supposed to absorb at least part of the power of the omitted variable.

In general, omitted variable exists if if affects both another predictor and the target variable. An unobserved omitted variable could be resolved with causal inference techniques, such as instrumental variable methods. I’ll write a post to explain it later 😃

These three assumptions are the only ones to ensure the unbiasedness of coefficient estimates in linear regression. Below I explained an unnecessary assumption — we don’t need it at all; and two nice to have assumptions.

“Unnecessary” Assumption. Linear relationship between Y and X

Do we need an approximately linear relationship between Y and X to begin our liner regression? No! Linear regression is not disabled when you have a more complicated version of data, because we can add exponents to predictors. For example, any polynomial functions can be modeled by linear regression:

Besides polynomials, we can also add interaction terms, such as Y = c + XZ + ε, which also corresponds to a nonlinear graph.

“Nice to have” assumption 1. Zero or small multicollinearity among predictors

Whether multicollinearity between predictors is a problem is still a debate in academic. However, I believe it’s not a problem, as long as the multicollinearity is not perfect — there’s no perfect liner relationship between predictors. In reality, it’s hard to find perfect multicollinearity or perfect zero multicollinearity. Most of time, we are in the middle grey area. The hope is having multicollinearity as low as possible. Why?

First, remember how we compute coefficient estimates of β and their variance (“their” because β is a vector):

Obviously, we need the inverse of the matrix version of X². But is it invertible? How to check if it’s invertible? Going back to your linear algebra class, remember that X’X is invertible only if columns in X are linearly independent. When there’s perfect collinearity between X₁ and X₂, X doesn’t have a full rank, and thus X’X is not invertible. Taking a step back, when there’s high collinearity between X₁ and X₂, X’X is almost not invertible, and thus the computation of estimates of β could be very difficult. In addition, the variance of β estimates will blow up. This time, how are we gonna perform all the statistical tests with a huge standard error?

“Nice to have” assumption 2. Normality of residuals

This assumption is nice to have only because of Gaussian-Markov Theorem, which basically says your β estimates are the best linear unbiased estimates (“BLUE”) if this assumption is true and estimates are unbiased (three necessary assumptions are also true). “Best” in terms of the smallest variance of β estimates. We omit math proof here.

P.S. if you enjoy my post or learn a bit from it, please remember to clap!😀

--

--