Linear Regression

5 min readJul 10, 2018

What is Linear Regression?

Linear regression is a statistical method of finding the relationship between independent and dependent variables.

For example: associating years of professional experience with remuneration.

In this case, “Years of Experience” is an independent variable (ie., we cannot mathematically determine the years of experience) and “Compensation” is a dependent variable (goal is to determine / predict salary — the dependent variable — based on years of experience).

Fitting the best intercept line

“Sum of Squared Errors” (SSE) is a simple, straightforward method to fit intercept lines between points — and compare those lines to find out the best fit through error reduction. The errors are the sum difference between actual value and predicted value.

The formula outlined below helps identify errors for each dependent value (ie., the sum of squared errors equals the square of the sum of the differences between each dependent variable and the average of dependent variables) :

The role of OLS -Ordinary Least Squares

Next, the “Ordinary Least Squares” (OLS) method is used to find the best line intercept (b) and the slope (m). [in y = mx + b, m is the slope and b the intercept]

The OLS method is reflect in the equation:

In other words → with OLS Linear Regression the goal is to find the line (or hyperplane) that minimizes the vertical offsets. We define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples i in our dataset of size n.

It is important to point out though that OLS method will work for a univariate dataset (ie., single independent variables and single dependent variables). Multivariate dataset contains a single independent variables set and multiple dependent variables sets, requiring a machine learning algorithm called “Gradient Descent”.

Gradient Descent Algorithm (GDA)

GDA’s main objective is to minimise the cost function.
Cost function h𝜽 helps us to figure out the best possible values for 𝜽0 and 𝜽1 which would provide the best fit line for the data points.

It is one of the best optimisation algorithms to minimise errors (difference of actual value and predicted value). Using GDA we will figure out a minimal cost function by applying various parameters for 𝜽0 and 𝜽1 and see the slope intercept until it reaches convergence.

In other words → we start with some values for 𝜽0 and 𝜽1 and change these values iteratively to reduce the cost. Gradient descent helps us on how to change the values.

It works roughly like this:

take a step towards downward direction;
from the each step, look out the direction again to get down faster and downhill quickly.

Mathematically speaking:

1. To update 𝜽0 and 𝜽1 we take gradients from the cost function. To find these gradients, we take partial derivatives with respect to 𝜽0 and 𝜽1

The partial derivatives are the gradients and they are used to update the values of 𝜽0 and 𝜽1 .

2. The number of steps taken is the learning rate (𝛼 below). This decides on how fast the algorithm converges to the minima.

Alpha is the learning rate which is a hyperparameter that you must specify. A smaller learning rate could get you closer to the minima but takes more time to reach the minima. A larger learning rate converges sooner but there is a chance that you could overshoot the minima.

***

IN SUMMARY

Start with the hypothesis:

which we need to fit to the training data.

We can use a cost function such Mean Squared Error:

which we can minimize using Gradient Descent:

Last piece of the puzzle to have a working linear regression model is the partial derivative of the the cost function:

Which turns out to be:

Which gives us linear regression.

Implementation in Python

Here's a sample OLS implementation in Python , using the popular Boston Housing dataset.

Last but not least → there are extensions of the training of the linear model called regularization methods, whose goal is to help minimize the sum of the squared error of the model on the training data (using ordinary least squares) but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear regression are:

Lasso Regression : where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).
Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).

These methods are effective to use when there is collinearity in your input values — and, as such, ordinary least squares would overfit the training data.

Linear Regression

Written by Jorge Leonel