[ML from scratch] Linear Regression

Normal Equation and Gradient Descent

Published in

Analytics Vidhya

4 min readSep 28, 2019

The very first Machine Learning model when we start learning is Linear Regression.

In this article, I will explain in depth 2 popular approaches to solve a linear regression problem: Normal Equation and Gradient Descent.

To work best, linear regression must follow these assumptions:

Linear Relationship between the features and targets.
Little or no Multi-collinearity between the features.
Normal distribution of error terms.
Homoscedasticity Assumption (a.k.a constant variance assumption).
Normal distribution of error terms.
Little or No auto-correlation in the residuals.

I won’t cover the details of these assumptions here. For more, please check this post.

The idea of Linear Regression is simple. We’re given a data with a set of features and output range is continuous. Our task is to find a best line (plane in 3D and hyperplane in multi-D) that has the error between the line and outputs is as close as possible.

Suppose we have a matrix of features and a vector of corresponding targets:

where N is number of data points and D is number of dimension at each data point.

At the simplest form, linear regression is linear transformation h mapping from X to y by parameter w:

1. Normal Equation:

Use a little bit of linear algebra knowledge, we can find w in closed-form:

However, in practice X is rarely square matrix, so we usually cannot invert X, the generalize approach is transpose it to square matrix, then take the invert:

This approach has 2 drawbacks:

First, X transpose X is really expensive computation. Suppose X has shape = (60000, 20) then X transpose X has 2⁰² * 60000 computation. Or even worst if X has high dimension, the computer memory will blow up.
Second, not every X transpose X is invertible. And invertible computation is also expensive.

This is where gradient descent algorithm comes in.

2. Gradient Descent:

The idea of algorithm is at any point we move small step as backward direction of its derivative. The algorithm always make sure it will approach global minimum in finite steps with appropriate learning rate (this is true for linear regression but not for other advanced models).

The most popular loss for regression is Mean Squared Error (MSE) which has the form:

Our task now is to find parameter w that minimizes the error of y and h(w). So how do we find the best w? As we learnt from high school that at a point x that make the function f(x) has derivative f(x)’ is zero is either minimum or maximum, since we know that the shape of MSE is like a bowl (convex function), then we can make sure that point is minimum.