Linear Regression: A Tale of A Transform

Published in

100 Shades of Machine Learning

5 min readJul 29, 2018

I wrote a blog with a demo few months back about how linear regression could be best understood as a minimization problem in the weight or parameters space. This is a summary of the original blog, without much math.

I have learned quite a lot about different machine learning algorithms and found that in order to get a better understanding of complex algorithms one should first try to understand one of the simplest machine learning algorithms: Linear Regression. These are the highlights of the blog:

Linear regression is not a new technique: It has been present for around 200 years.
Understanding Linear Regression boils down to minimizing the cost function.
The cost function is always quadratic (parabola/paraboloid) for linear as well as for polynomial regression.
Gradient descent is not a fundamental requirement for minimization.
An alternative to gradient descent is to set the gradients of the cost to zero to get Normal equations.
Normal equations can be solved with or without any matrix algebra.
Geometrically the minimization is finding the local minimum on a hyper-dimensional paraboloid.

What is Linear Regression

The idea of being able to predict the “unknown” based on what is already known is the genesis of many machine learning algorithms and Linear Regression is one of them. In simple language linear regression means “line fitting”, an idea that dates back to early 19th century.

But how can the notion of line fitting be helpful? In the blog I tried to find a line to predict a person’s body weight from their abdomen size. This is a sample of the data we used.

When the weight column is plotted against the abdomen column the data seems to lie nearly on a line.

The Weight-Abdomen Data seems to lie on a line. The green and orange lines are just two of the many candidate lines.

Out of many candidate lines linear regression finds the best line.

How to Find the Best Line using a cost function

Any of the candidate lines can be represented by the equation

y = m * x + c

where m and c will be different for every candidate.

In order to find the best fit line for the existing data we must use some kind of metric to measure the “goodness of fit”. One of the most commonly used metrics is “mean squared error” wherein we compute the squares of the difference between observed and expected y values.

For a sequence of data in the form of (xᵢ,yᵢ), where the subscript i stands for the i-th data point, the mean squared error is:

Σ (yᵢ- m⋅xᵢ - c)^2

where the sum is over i. Because xᵢ and yᵢ are known values (abdomen size and weight), the sum becomes a quadratic expression in m and c. For our dataset the quadratic expression is:

c^2  + 183.20⋅c⋅m - 354.56⋅c + 8499.24⋅m^2  - 32996.13⋅m + 32233.76

Minimizing the Cost Function Using Normal Equations

The cost function is an expression in two variables: m and c. To minimize the cost function we can differentiate it with respect to each of the variables and set the gradients to zero. This is how we learnt to minimize expressions in calculus. For the quadratic expression above this process gives us two equations:

183.20⋅c + 16998.49⋅m - 32996.13 = 0
2⋅c + 183.20⋅m - 354.56 = 0

Solving these equations will give the m and c that minimizes the cost. In geometrical terms it will give us a unique line that minimizes the cost function. The equations can also be written in their matrix form:

And can be solved by multiplying both sides by the inverse of the matrix:

This is roughly the idea behind the so-called Normal equation method.

Normal Equations in the case of Multivariate Linear Regression

When there are two or more input variables linear regression becomes the problem of fitting a plane instead of a line. The equation of a candidate plane for two variables is of the form:

z = a*x + b*y + c

Once again x, y and z are known so the cost function:

Σ (zᵢ- a⋅xᵢ - b⋅yᵢ - c)^2

is a quadratic but in three variables: a, b, c. So we will calculate three gradients and set them to zero. Although in this case it is harder to visualize the shape of the cost function because it is four dimensional (A function of three variables). In fact we can use the same method to find the best plane for cases where there are more than two variables. In such cases a candidate hyper-plane has the equation:

h = a*w + b*x + c*y + d*z + ...

The cost function is still a quadratic, although in many more variables: a, b, c, d, … And its gradients with respect to each of the variables, when set to zero, is still a system of linear equations.

Why is Gradient Descent Needed at All

If you have noticed, the number of variables in the cost function is one more than the number of input variables. For example when we have only a single input variable: abdomen size (x) the cost is a function of two variables: m and c and when there are two input variables: x and y the cost is a function of three variables: a, b and c. The number of gradients also increases like this. So we have two gradients of the cost function in the first case, three in the second case and so on.

When the number of input variables is quite large the number of gradients to be set to zero and solved is huge. The time complexity of solving a system of linear equations is O(n³) and can be prohibitively large when number of gradient equations is ≥ 10000. In such cases it’s not possible to use the closed-form normal equation method. Instead an iterative process called “Gradient Descent” is used.

Conclusion

If you have found this post useful do check out my detailed blog at https://safwanahmad.github.io/2018/01/21/Linear-Regression-A-Tale-of-a-Transform.html

I’ll cover Gradient descent, Back propagation, Regularization and a derivation of Normal Equations in blogs to come.