Learning Python Regression Analysis — part 3 : Ordinary Least Squares

In the previous section we have covered the basics of the simple linear regression.

Usage of Ordinary Least Squares (OLS) method to estimate the best-fit model for simple linear regression. OLS is a generalized linear modeling technique. This technique is specified by an equation with certain parameters to the observed data. It will be important to understand the internals and working of least squares method to be able to comprehend the practical issues, with model fitting, that may arise in some real world problems.

In linear cases, least squares problem has a closed solution whereas non-linear least squares problems are generally solved by iterations.

For simple linear case, OLS regression model can be written as:

Where Yi is the case i’s value of response variable and Xi is the case i’s value of predictor variable. Now we want to minimize the sum of the squared errors of prediction.

Sum Squared Error (SSE) can be written as:

Specifically, we want to find the values of a and b which minimizes the value of SSE above. We need to express SSE in terms of a and b, take the derivatives of SSE with respect to a and b, set these derivatives to zero, and solve for a and b. We will not go into the mathematical details of solving the derivative so we can directly jump to the solution part.

a = Cov(X,Y) / Var(X)

Where,

And

The coefficient of predictor variable is the ratio of the sum of the cross products of all xi and yi (also called as covariance) over the sum of squares for each xi (also called as variance). Following is the solution for intercept coefficient:

b = mean(Y) — a.mean(X)

Now we will show the implementation of ordinary least squares in Python with just NumPy, without using any readymade OLS implementation.

>>> import numpy as np
>>> X=[1,1,2,2,2.3,3,3,3.5,4,4.3]
>>> Y=[6.9,6.7,13.8,14.7,16.5,18.7,17.4,22,29.4,34.5]
>>> var=np.var(X,ddof=1) #ddof parameter is used to set Bessel’s correction
>>> var
1.3232222222222223
>>> cov=np.cov(X,Y)
>>> cov[0][1]
9.8260000000000005

The above code snippet prints the values of variance and covariance. Now we will use these values to compute values of OLS regression coefficients.

>>> b= cov[0][1]/var
>>> b
7.425812410781762
>>> a=np.mean(Y)
>>> a=np.mean(Y) - b*np.mean(X)
>>> a
-1.3213703921404019

In the code segment above we have computed the values of regression coefficients, which are exactly same as computed by statsmodels OLS method, used in previous section.

Similarly least squares coefficient can be computed for linear models with more than one predictor variables. In such cases we may utilize matrix algebra to compute the coefficients.

We will have a look at the Multiple Linear Regression basics in the next part.