What are the assumptions for linear regression and how to solve mis-specification?
The first assumption to be made in linear models is that the relation is linear. The well-known Gauss-Markov Model assumes that
This means that Y is a linear combination of X and an error term.
How to Check Linearity
If the number of features is small, we can plot the estimated residual v.s. features and notice any non-random relations. If there are many features, we can plot the estimated residuals against the predicted values.
The theory behind is that assuming the model is linear, estimated coefficients should be independent of the errors. If the latter is violated, clearly the first assumption is wrong.
Problems and Solutions to the Violation of Linearity
If linearity assumption is violated, then the estimated coefficient by the normal equation is no longer BLUE. BLUE stands for the best linear unbiased estimator, which means the estimated coefficient is an unbiased estimator of the original model with the least variance, assuming the model is indeed linear. Some statisticians argue that linear regression still yields a good result for non-linear cases. Therefore, the first solution is simple: ignore it. However, for all the non-linearity cases, we should carefully revise the model.
The second solution is to include polynomial coefficients. For example, if Y ~ X is not linear, then let’s try Y~X+X²+X³. This can partially solve the non-linear problem. But it has an obvious downside: it increases the number of features drastically and worsen the curse of dimension and overfitting.
In most cases, the linear model assumption is a bit stronger than Gauss-Markov model. We further assume:
With this normality assumption, we can build confidence interval and use t-test to test whether a linear combination of the coefficients are different from 0. We may also use F-test to check whether any one of the coefficients are significantly different from 0. Specifically, the covariance of estimated beta is:
If normality does not hold, it means we do not have the theoretical background to conduct hypothesis testings. However, empirically it does not matter that much. i.e. even if the error term is not normal, the estimated parameters still exhibit a normal behavior.
Looking at the Gauss-Markov assumption, we can see that the error term is independent of each other with the same size sigma². This is homoscedasticity or maybe we can also say it is stationarity if it is a time-series data. That is, the variance of the output remains constant for all the data points.
If homoscedasticity is violated, then the Gauss-Markov model is not assumed. Since BLUE is proved based on Gauss-Markov assumption, then the regression parameter is no longer BLUE… right? Actually, we can show that the estimated coefficients are still unbiased. Furthermore, with large number of data points, the estimated coefficients are consistent. In other words, it has the least variance asymptotically. Therefore, with heteroscedasticity, the beta from the normal equation is asymptotically BLUE.
Problems and Solution to Heteroscedasticity
Okay, now we know BLUE is still valid. But how do we conduct t-test and F-test? The variance estimator is no longer what it was in section Normality. Instead, we need to use EHW estimator (aka. sandwich estimator) for variance of estimated beta. The trade-off is that, though EHW is an asymptotically unbiased estimator in heteroscedastic cases, it is larger than when it was homoscedastic. So, it is more difficult to reject H null.
Linearly Independent Features
Linear independency of features is not strictly an assumption. Because under the frequentist perspective, X is a fixed observation. There is no randomness in X so the definition of independence does not apply here.
Nonetheless, if the correlation among features are high, there will be a problem. The following inverse of X transpose X will be very large or even contains infinity.
This is because X transpose X will be (nearly) singular if the correlation between columns in X is large. Why? consider two columns that are identical, then the column space of X will be p-1, so the rank of X transpose X will be p-1 but the squared matrix is of size p. So it is non-invertible.
Now, looking at this equation. The covariance of estimated beta will be huge. This causes two problem: 1. it is extremely difficult to reject some of the coefficients. 2. the estimated beta, though is unbiased, has a very large variance.
Solution to collinearity
Applying a ridge regression is a simple solution to collinearity. Ridge regression is equivalent to adding a small number to the ridge of the matrix X transpose X, which makes it non-singular and all the numbers there have a reasonable range.