Machine Learning 101 — Linear Regression using the OLS Method

Published in

Analytics Vidhya

8 min readJul 8, 2020

Linear Regression is one of the most basic Machine Learning algorithms and is used to predict real values. It involves using one or more independent variables to predict a dependent variable. Although it is one of the simplest algorithms we will encounter it is extremely powerful and robust in nature, making it an essential tool for aspiring data professionals.

In this blog post, we’ll cover the types of linear regression, it’s implementation using the Ordinary least squares (OLS) method and certain underlying assumptions made by linear regression models. Throughout this post, we’ll also be referencing an example to predict an employee’s salary on the basis of their experience.

Types of Linear Regression

Simple Linear Regression

In a simple linear regression model, there exists only one independent variable which determines the dependent variable. So for our salary-experience example, the independent variable is an employee’s experience while salary is the dependent variable.

Image Source: https://medium.com/@manjabogicevic/multiple-linear-regression-using-python-b99754591ac0

In our simple linear regression equation:

y is the dependent variable
b₀ is our bias term, and
x₁ is an independent variable whose weight is b₁

Multiple Linear Regression

This type of regression is just an extension of simple linear regression. Here our dependent variable y is predicted using two or more independent variables as part of the set of input features. Simply stated, if we add more input features such as daily working hours, age, position, etc. to our salary-experience example then we obtain a multiple linear regression model.

Going Through the Basics

So, let’s assume we finally collected some data which contains the experience and salary of a number of employees. To understand our data better, we plot it and get the following graph:

The basic idea behind linear regression is to fit a straight line to our data. We can do so by using the Ordinary least squares (OLS) method. In this method, we draw a line through the data, measure the distance of each point from the line, square each distance, and then add them all up. After a lot of trial and error, we’re able to find the best fit line. Essentially, the best fit line covers all of our data points such that the distance of each data point from the line is minimized. This in turn minimizes the error obtained. We can see the best fit line for our data in the following plot:

The next step in our process is to determine how good or useful our regression model actually is, by calculating its R² value. To do so, we first find the mean value of experience, calculate the difference between the mean and value at each data point, square it, and then add up all these values. We call this SS(mean), i.e. the sum of squares around the mean. Mathematically, we may represent this calculation as follows (where n is our sample size):

Let’s go back to our original salary vs experience plot which depicts the best fit line for our data. Just as we did before, we calculate SS(fit), i.e. the sum of squares around the best fit line:

In general, we can view the variance of some data in a more abstract form as follows:

Among all of these formulas, there is in fact a pattern to be noticed. If we look closely then we find that the value of SS(mean) is always greater than that of SS(fit). That should not be surprising considering the fact that the SS(fit) depicts the best fit line, i.e. it minimizes the sum of squares. Thus, it would be appropriate to say that the R² value can tell us how much variation in the salary can be explained by taking into account an employee’s experience. Mathematically, we obtain the following:

Consequently, this formula may also be written as follows by eliminating n (the sample size):

The R² value of a model lies between minus infinity and 1. We can say that the closer our value is to 1, the better our independent variables are at explaining the variance of the dependent variable.

For example, let’s assume that we obtain an R² value of 0.75 for our data. This means that there is a 75% reduction in variance when we take into account an employee’s experience. Alternatively, we can say that an employee’s experience can explain 75% of the variation in salaries.

Note: One important fact to take into consideration while analyzing the R² value of your regression model is that it will always keep on increasing if you keep on adding more features to your model. So if you ever find yourself obtaining an R² value of 0.95 or higher without any tinkering whatsoever, then you should probably take your results with a pinch of salt.

There are many other metrics to evaluate our regression models such as the Mean Absolute Error, Mean Squared Error, and the Adjusted R² value. We’ll discuss those in detail in another blog post.

Now, we’ve obtained an R² value that seems great to us, but how do we know if that value is correct? To determine whether or not our R² value is statistically significant, we need to calculate the p-value. The p-value is calculated using something called F as shown below:

Although the equation may seem confusing at first, the numerator simply denotes the reduction in variance once we take experience into account and the denominator represents the variation in the residuals (shown by the dotted lines) in the graph below:

Mathematically, F is calculated using the following equation:

where

the denominators (p_fit-p_mean) and (n-p_fit) represent the degrees of freedom
p_fit is the number of parameters in the fit line
p_mean is the number of parameters in the mean line

In essence, the numerator becomes the variance explained by any extra parameters and the denominator is the sum of squares of the residuals once we find the best fit line. Thus, if the fit is good then F turns out to be a really large number. Now, to turn this value of F into a p-value we take the following steps:

Generate a set of random data
Calculate the mean and SS(mean)
Calculate the fit and SS(fit)
Plug all of these values into the equation to find F
Plot this value in a histogram
Repeat lots and lots of times

Once we repeat this process thousands (or even millions) of times, we calculate the value of F for our best fit line. The p-value is then obtained as the count of more extreme values divided by the total number of values.

For example, if the value of F for our best fit line is 5 and we have 6 instances out of 100 total instances which are greater than or equal to 5, then our p-value will be 6/100 = 0.06

In reality, we do not often follow this process to generate the p-value as it is very time-consuming. Instead, we approximate the histogram with a line by using F-distributions.

Assumptions of the OLS Method

Now that we’ve learned how the OLS method works, we should also know what underlying assumptions are made in this method. There are seven classical OLS assumptions for Linear Regression. Out of these, the first six are necessary to produce a good model, whereas the last assumption is mostly used for analysis.

The regression model is linear — This means that the terms in the model are either constant or a parameter multiplied by an independent variable and our models are limited to the generic equations we discussed earlier.
The error term has a population of zero — The error terms describes the variation in the dependent variable that our independent variables aren’t able to explain. We only want random error left for our error term, i.e. the error term should be unpredictable.
All independent variables are uncorrelated to the error term — If an independent variable is correlated to the error term then we can use the independent variable to predict the error term. This should not be true for our regression model because it violates the notion that the error term is unpredictable in nature. This assumption is often referred to as exogeneity.
Observations of the error term are uncorrelated — There should be a randomness in our error terms such that one observation of the error term does not predict the next observation.
The error term has a constant variance — The variance of errors should be consistent for all observations. If the variance does not change for each observation or a range of observations it is known as homoscedasticity, which is desirable for our regression model. On the other hand, heteroscedasticity reduces the precision of our estimates in OLS linear regression.
No independent variable is a perfect linear function of another independent variable— Perfect correlation exists when two variables have a Pearson’s correlation coefficient of +1 or -1. This means that if we increase one variable then the other variable will also increase (when correlation is +1), and if we increase one variable then the other variable will decrease (when correlation is -1). Ordinary least squares can not distinguish between two variables if they are perfectly correlated and this will cause an error in our model. This assumption is referred to as multicollinearity.
The error term is normally distributed — Although this is not a necessary condition, if satisfied this can help us generate reliable confidence intervals and prediction intervals. This assumption is also extremely helpful if we need to calculate the p-values for our coefficient estimates.

Wrapping Up…

In this blog post we’ve learned the different types of linear regression and how it is implemented using the ordinary least squares (OLS) method. Along with this, we covered the assumptions made by our regression models while implementing OLS linear regression.

Thanks for reading and stay tuned for more!

Resources: