It’s All About Regression — Ordinary Least Squares (OLS)

Vachan Anand
Feb 4 · 5 min read

In this series consisting of multiple blogs we are going to look into regression models. This is going to be different that a lot of other regression tutorials / lectures you might have seen online as we will cover the topic from multiple angles

What sets this series apart from the rest out there
- Real world examples and application
- predictive analysis
- inferential analysis (something almost always forgotten)
- furthermore, we’ll look into regression beyond Ordinary Least Squares (OLS)

So without any further ado, let us get started.

Linear Regression

A linear regression is a statistical model that maps a relation between the predictors/features and response variables. It is represented as following


For the purpose of this tutorial consider error term epsilon to be same as E (residual). Both of them are different with regards to being observed(E , which belongs to the observable sample and can be reduced ) and not being observed (epsilon, which belongs to the population and often times called irreducible error).

For a more concrete understanding click on the link below.

A linear regression model tries to find an optimal value for betas by reducing a cost function. The cost function may vary based on different methods / algorithms. To get a detailed explanation, check out this blog on most commonly used cost functions.

Ordinary Least Squares

Ordinary least squares, or OLS, is method for estimating the parameters for a regression model. It tries to estimate betas by reducing the cost function; i.e., the sum of squared distance between the response variable and the prediction made by the model.

The cost function for OLS is defined as


n = number of observations

p = number of features

For the purpose of illustration let us assume that we have a dataset as follows

X = [1,2,3,4,5,6,7,8,9]
Input Data

We need to find a relationship between X and Y. So let’s see if we can do that using OLS.

Since linear regression belongs to what is called a parametric family, we need to define a set of parameters (or a relationship between X and Y) that we want the OLS process to estimate. So in general since we have one feature X and one response Y, we need to find a relationship between X and Y such that we can estimate the value of Y given X. So the equation take the form of :

Here, we want OLS to estimate the value of beta_{0} and beta_{1} so that we could map a relationship between X and Y.

OLS tries to estimate betas using a method called gradient decent. For the scope of this blog, we focus on application side of the linear regression and hence wouldn’t dive deep into gradient descent algorithm. For now think of gradient descent as an algorithm where you give in the equation (cost function) and it spits out a value of betas that reduces the cost function to the minimum for any particular relationship. To get a better understanding of gradient descent, check out this blog.

In python, we make use of statsmodel package to use OLS. It takes care of all the optimisation of cost function and math behind it, all we need to give is input features and response variables. The OLS function would fit a model and estimate the parameters.

model = sm.OLS(Y,sm.add_constant(X)).fit()

We fit an OLS model and lets, check out the summary.


Here we can see in the coef (coefficients )that the model estimated are

So the mapping is as follows :

Y = 1 + ( 2 * X )
Predictions vs input data

It seems like the model did really good job in predicting the value of Y given X, so now if we give model a value of x = 67 which the model has never seen before, it should spit out a value of
1 + (67 * 2) = 135


Suppose you want to get a prediction for Y when X is [67,45,555,123]

X_pred = [67,45,555,123]
Y_pred = model.predict(sm.add_constant(X_pred))

The output for the command above is going to be an array of [ 135, 91, 1111, 247]

Congratulations, that’s your first machine learning model.
Well, even though linear regression is a very simplistic model, it becomes really powerful if the assumptions under which it operates hold true.


The model however simple needs to satisfy the following assumptions for it to work as desired
- Linearity : The model assumes that there is a linear relationship between the features X and response Y
- Normality of residuals : The residual/errors are independent and normally distributed
- Constant Variability : The variance of the residual is constant, homoscedasticity

Next, we’ll dive deep into cost function and the gradient descent algorithm. Check it out here.

Until then, keep rocking!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…