Simple Linear Regression and ANOVA
Linear regression analysis one of the earliest models used in pattern recognition and is one of the most commonly used algorithms in statistics. The goal of this article is to assist in interpretation of results of linear regression, rather than concentrating on technicalities of regression analysis and ANOVA.
Background — linear regression:
The purpose of linear regression is to study a variable Y as a function of a variable X. In simple linear regression there is only 1 independent variable. It is called ‘linear’ because we represent the estimate of dependent variable (Y) as a linear function of the independent variable (X). The general form is:
A regression analysis of the form
is also linear in X². The coefficients are estimated by minimizing the sum of the squares of residues as show in this article. The formulation is also called ‘ordinary least square’ (OLS). There are several restrictions to be satisfied to obtain a well behaved unique solution for the OLS problem. For simplicity, let us assume that the errors are ‘well behaved’ and the solution converges to unique point, etc.
Background — ANOVA:
Analysis of variance is used to study the difference between means of groups. Since variations are observed within each group and also between different groups, studying variance and covariance also falls in the scope of ANOVA. In general, the terms that are encountered are within sum of squares (error), between sum of squares (regression model) and total sum of squares (null/no model). Dividing these terms by the corresponding degrees of freedom yields the corresponding mean square terms. Since the error variance is estimated, ratio of mean squares is taken to obtain a F statistic.
Simple linear regression and ANOVA:
A not-so-obvious fact is that simple (ordinary least square) linear regression under standard conditions is a special case of ANOVA. General form of ANOVA can be written as:
Substituting equation 2 in equation 1, we get:
Notice that the property from equation 1 of this article on OLS solution was used in the derivation. Effectively, the total sum of squares has been decomposed into sum of squares of 2 components. We obtain the mean squares by dividing these numbers by the corresponding degrees of freedom. This is the relationship between simple linear regression and ANOVA — recall two factor ANOVA.
The variance of the error is unknown and is estimated as
Therefore, chi-square test cannot be applied to obtain the statistical significance of the model. Statistical significance is established using F test. F test examines the evidence against the null hypothesis by comparing the sample F statistic with the critical value. Sample F statistic is given by:
For a simple linear regression model, we use the following hypotheses:
Notice that for a model with p variables and n data points
The test can be rewritten as:
Closing note (multiple regression):
Let us examine the F test for multiple regression:
Let us assume that the full model has p variables. Let us assume that a reduced model was identified with q variables. ANOVA can be used to compare whether the reduced model is sufficient to explain the variance in outcome variable. Let the reduced model have coefficients
Therefore, F test also acts as a model/variable selection criteria. However, we made strong assumptions (such as homoskedasticity, no multicollinearity, etc.) which should be tested. This article provides few details.