This article is a section of Linear Regression in a NutShell
Ordinary Least Squares regression (OLS) is more commonly named linear regression algorithm is a type of linear least-squares method for estimating the unknown parameters in a linear regression model.
In the case of a model with ’n’ explanatory variables, the OLS regression equation is given as:
- y is the dependent variable
- β₀ is the intercept of the model
- xᵢ corresponds to the iᵗʰ explanatory variable of the model
- ε is the random error with zero mean and variance σ²
In OLS the least-square stands for minimum square error or SSE (Sum of Squared Error).
Lower the error of the model better its explanatory power.
So this method aims to find the line which minimizes the sum of squared errors.
We can find many lines that fit the data but the OLS determines the one with the smallest error.
Graphically it is the one closest to all points simultaneously
Such a system usually has no exact solution, so the goal is instead to find the coefficients β which fit the equations “best”.
Simple linear regression
For the Simple linear regression model, the computation is simple. Consider the equation of simple linear regression :
To calculate values of α and β, OLS minimizes error term using the equations :
Multiple Linear Regression
For multiple linear regression, the computation becomes a bit complex. Since in multiple regression there are more than two dimensions, we represent them using high-dimensional hyperplanes.
This is a minimization problem we will make use of calculus and linear algebra to determine the slope and intercept of the line.
The expression used to find the best fitting line is :
- T denotes the matrix transpose
- X indicates the values of all the independent variables associated with a particular value of the dependent variable and Xᵢ = xᵢᵗ
- y denotes the dependent variable
- The value of b which minimizes this sum of the square error is called the OLS estimator for β.
Suppose b is a “candidate” value for the β. The quantity (yᵢ − xᵢᵗb), called the residual for i ᵗʰ observation, measures the vertical distance between the data point (xᵢ, yᵢ) and the hyperplane y = xᵗb, and thus assesses the degree of fit between the actual data and the model.
The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.
We can try minimizing the squared sum of errors on paper but with a larger dataset, it is almost impossible.
Nowadays, regression analysis is performed through software and programming languages like SAS, Excel, Python, and R.
There are other methods for determining the regression line. They are usually preferred in different contexts.
Some of them are :
- Generalized least squares
- Maximum likelihood estimation
- Bayesian regression
- Kernel regression
- Gaussian process regression
However, OLS is yet powerful enough for many if not for most linear problems.
There are five different assumptions of OLS to be considered before performing regression analysis.
- No endogeneity
- Normality and Homoscedasticity
- No Autocorrelation
- No Multicollinearity
The linear regression assumes linearity. Each independent variable is multiplied by a coefficient and summed up to predict the value. The linear regression is the simplest non-trivial relationship. It is called linear because the equation is linear.
Linearity means there must be a linear relationship between dependent and independent variables.
Check for Linearity
One way is to scatter plot the independent variable against the dependent variable. If the data points from a pattern that looks like a straight line then the linear regression model is suitable.
Fixes for linearity
- Run a non-linear regression
- Exponential transformation
- Logarithmic transformation
It refers to the prohibition of a link between the independent variables and the errors.
Mathematically expressed as :
In this case, the summation of the error term with the difference between the observed values and the predicted values is correlated with independent variables. This problem is referred to as omitted variable bias.
The omitted variable bias is introduced when the relevant variable is not included in the analysis.
Basically, everything which is not explained by the model goes into the error.
- The incorrect exclusion of a variable leads to biased and counterintuitive estimates that are toxic to regression analysis.
- An incorrect inclusion of a variable leads to inefficient estimates which don’t bias the regression and one can drop these variables.
Fixes for Endogeneity
Omitted variables bias varies from problem to problem. It is always sneaky and to overcome it one must have experience and advance knowledge.
Normality and Homoscedasticity
1. Normality - We assume that the error term is normally distributed.
What if the error term is not normally distributed?
The solution to the problem is the central limit theorem.
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample’s means will be approximately normally distributed.
This theorem makes the error term normal by default.
2. Homoscedasticity - Homoscedasticity means to have equal variance. The error term should have equal variance with each other.
Consider an example,
If a person is poor then he or she will spend a constant amount of money on food and other accommodates. But the wealthier an individual, the higher is the variability of his expenditure. Therefore heteroscedasticity exists.
Homoscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values. This is mainly due to the presence of outliers in the data.
An outlier in heteroscedasticity means that the observations that are either small or large with respect to the other observations are present in the sample.
Fixes for heteroscedasticity
- Check for Omitted variable bias
- Look for outliers and try to remove them
- Perform log transformation on the explanatory variable
No autocorrelation is also known as no serial correlation. According to the assumptions Errors should not be uncorrelated.
Check for Autocorrelation
- Plot all the residuals on the graph and check for patterns. If there are no patterns to be seen then there is no autocorrelation.
- Durbin Watson test - Its value falls between 0 to 4. Value of 2 indicated no autocorrelation. Values below 1 and above 3 indicated the presence of autocorrelation
Fixes for Autocorrelation
The only solution for autocorrelation is to avoid using linear regression.
One of the examples of the Autocorrelation problem is the time series analysis.
Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related.
We observe multicollinearity when two or more variables have a high correlation.
Consider an example with a equation a = 2 + 5 * b.
This equation can be rearranged as b = (a - 2) / 5.
- ‘a’ and ‘b’ are two variables with exact linear combinations
- Because ‘b’ can be represented using ‘a’ and vice-versa.
A model containing ‘a’ and ‘b’ as explanatory variables would have perfect multicollinearity. This imposes a big problem on our regression model as the coefficients will be wrongly estimated.
The reasoning is that if ‘a’ can be represented with ‘b’ then there is no point using both we can just keep one of them.
Check for Multicollinearity
- Multicollinearity is a big problem but is also the easiest to notice.
- Before creating the regression find the correlation between each of the two pairs of independent variables.
Fixes for Multicollinearity
- Drop one of the two features
- Transform two features into a single feature