How does R solve linear regression?
Let’s first think of an interesting example.
What if X1 and X2 are highly correlated in a regression?
We know it is a multicollinearity problem and (X^TX) will be non-invertible. Usually we would solve it by ridge regression. But how would R solve this problem?
The coefficient of b is zero! Why is that?
We all know the closed-form formula for linear regression for the coefficients:
However, the time complexity for inverting (X^TX) is O(p²) and the complexity for multiplying X^T and X is O(n*p³). When n and p are large, it costs too much computation power.
In R, instead, it borrows a powerful theorem in linear regression — the Frisch-Waugh-Lovell Theorem — to reduce the computation complexity.
Frisch-Waugh-Lovell Theorem
The theorem states as follows. Assume we are regressing Y~1+X1+X2 and assume the coefficient for X1 here is b1. Next, assume we have two more short regressions:
- X1 ~ 1 + X2
- Y ~ 1 + X2
Then, denote the residual from the first short regression as e1, from the second short regression as e2.
Now, let us do another regression e1 ~ e2 and obtain the coefficient b1'. Then, it always hold that b1' = b1.
What does it mean?
It means that each coefficient in linear regression measures its independent contribution to the model. The residual from the first short regression is like excluding X2’s effect on X1. The residual from the second short regression is like excluding X2’s effect on Y. Then, the theorem tells us that the coefficient of X1 will always be the same, before or after we excluded X2’s effect.
How does R Solve Linear Regression?
Okay, now we have the theorem on hand, what’s next? Instead of running Y~1+X1+X2, we can first running Y~1+X1 and getting the residual and running X2~X1 and get the coefficient of X2. This is not a big deal if we only have two features.
Assume now that we have 100 features: Y~1+X1+X2+…+X100.
Then, we can use a dynamic programming way to get each coefficient by just running short regressions. Recall that the time complexity is about O(n*p³)+O(p²). If p is reduced to only 2, then it exponentially reduce the time complexity.
Back to the Example
But let’s apply the Frisch-Waugh-Lovell Theorem here.
Assume we first run Y~1+X1 and X2~1+X1, then clearly the residuals of the second one will be 0. So, the coefficient for X2 would be 0. This is exactly what R gives us.
Also, if we first run Y~1+X2 and X1~1+X2, then the coefficient of X1 would be 0 instead. But clearly in R, it might order the way such short regressions are done by alphabetic order. So, if we reverse the order, b still get 0 as its coefficient.