Linear Regression: The Normal Equation

Prithvi Prakash
9 min readApr 27, 2019

--

Linear Regression is probably one of the simplest yet, incredibly powerful Machine Learning Algorithms that are in use today. As the name suggests, we make use of this algorithm to solve regression problems, where the “target” we’re trying to model our learning algorithm to predict takes up continuous values.

Like other Supervised Machine Learning Algorithms, Linear Regression follows the same flow, where we feed into our Learning Algorithm a set of training examples containing a set of features, traditionally denoted as “X” and outcomes, denoted by “y”. The Algorithm “learns” from these examples and generates a model or a hypothesis, that if trained properly should ideally be able to make accurate predictions to unseen data.

Unlike traditional Algorithms, Learning Algorithms carry out their instructions from being shown, rather than being told.

In this piece, I will attempt to break down the underlying Philosophy, Mathematics, and hopefully, effectively implement Linear Regression, using The Normal Equation. We will be making use of the Portland Housing Dataset as our “toy data”.

The Philosophy And Mathematics

Below is a snapshot of our housing dataset.

| Area (Square Feet) | Number of Rooms | Price ($)  |
| ------------------ | --------------- | ---------- |
| 2104 | 3 | 399900 |
| 1600 | 3 | 329900 |
| 4478 | 5 | 699900 |

From the training data, we observe 2 features, Area, and Number of Rooms, and an outcome, Price associated with it. The entire dataset has 47 instances of these. The goal here is to leverage the 2 features, and their respective outcomes we are afforded from the 47 examples to build a predictor that can effectively “guess” the price of a house if we’re given the aforementioned features for a “new house” that isn’t in this dataset.

In X, we will be storing our features, and in y, their corresponding prices. That leaves us with the following Matrices:

1.1 Data Matrix and Target Matrix; where m = 47, and n = 2;

Our goal is to create a function that can map X to y, such that:

1.2 Approximation Function

We define our “approximation function” or “hypothesis” with a linear equation, as follows:

1.3 Linear Equation Mapping Data Matrix to the Hypothesis

In the above equation, 1.3, we’ve introduced a new set of terms, θ₀, θ₁, and θ₂. In order to accomplish a hypothesis as defined in 1.2, we need to tune θ₀, θ₁, and θ₂ to best fit the equation. These θ terms are commonly referred to as “weights” or “parameters”, and like X and y we store them in a matrix as follows:

1.4 Weight Matrix

Our Linear Equation in 1.3 can be rewritten as the Matrix dot product of X and Θ, but we encounter a dimensionality mismatch; X has m rows and n columns, whereas Θ is a row vector with (n+1) columns. To overcome this dimensionality mismatch, and accommodate the nought term θ₀, we “pad” X with an extra column of 1s. In essence, all we’re doing is:

1.5 Padding X with a column of 1s; where m = 47;

Now, the question might arise, why are we going through all this trouble to accommodate θ₀? Why can’t we just drop the term altogether?
To answer that, recall the equation of a line:

1.6 Equation of a line

The above equation in 1.6 has two components, m the slope, and c the intercept. The value of m drives the gradient of the line, while c determines the point at which the line intercepts the y-axis. From our equation in 1.3, we’re essentially trying to accomplish the same thing; Tune θ₁ and θ₂ to control our gradients across the features X₁ and X₂. And tune θ₀ which is analogous to c, to find the ideal intercept value to best fit our data. Following our manipulation of X, we rewrite 1.3 as follows:

1.7 Linear Equation Mapping Data to the Hypothesis Function after “padding” X

And since X₀ is always 1, θ₀X₀ is essentially θ₀. Now that we have the dimensions of X and Θ in compliance, we rewrite 1.7 as the following Matrix Operation:

1.8 Linear Equation of Hypothesis as a Matrix Operation

Now that we have our Hypothesis in place, it’s time to make use of it to tune Θ to map to y. Recall from 1.2, we want this hypothesis, f̂(X) to be as close to y as possible. While we know tuning our weights, Θ is necessary to accomplish this approximation of y, we need some performance metric to gauge how far off a particular instance of Θ’s values are from doing so. To put things simply, all we’re trying to do is:

1.9 Hypothesis Performance Metric’s Goal

Hence, we define a “Loss Function” which gives us a “score” for how well or poorly a particular instance of Θ’s values performed to get close to y. The most commonly used Loss Function is the Mean Squared Error (MSE). And that’s what we’ll be using to define our Loss Function as well. All MSE is, is the Sum of all the individual square distances between the hypothesis and the respective outcomes. The following equation denotes the MSE, denoted by J:

1.11 MSE Loss Function

It is important to understand that our Loss Function is parametrised by the value of our weights, Θ. In other words, changing Θ changes J(Θ). As stated earlier, the above equation, 1.11, is nothing but the sum of the squared distances between y and f̂(X). Given that, we can define the “Design Matrix” as follows:

1.12 Design Matrix

On substituting 1.8 into the above equation, 1.12, we get the following:

1.13 Design Matrix with substituted Hypothesis

And on substituting the above-obtained Design Matrix, 1.13 into the lost function from 1.11 we get:

1.14 MSE Loss Function with Design Matrix

With our loss function in place, we can now go ahead and simplify the above equation as follows:

1.15 Simplifying Loss Function

In the above equation 1.15, pay close attention to the Second and Third terms; both terms produce scalars.

1.16 Dimensionality of Second and Third terms

And since the individual components of both terms are the same, they essentially produce the same result. Hence, we can simplify our Loss Function further:

1.17 Loss Function further simplified

With a simplified Loss Function, 1.17, in place, and the goal of reducing this “loss” as much as possible from 1.9, we can now take our next step of finding an instance of Θ such that J(Θ) becomes minimum. We do so by calculating the Derivative of our Loss Function with respect to Θ.

1.18 Differentiating Loss Function wrt Θ

Starting with the first term:

1.19 Differentiating against a term with no Θ

For the second term, we first factor out the Non-Θ terms:

1.20 Factor Out Non-Θ terms in Second Term

And then differentiate:

1.21 Differentiating the Second Term wrt Θ

Finally, that leaves us with our last term, and just like we did with our second term, we factor out Non-Θ terms first:

1.22 Factor Out Non-Θ terms in Third Term

On calculating the derivative of which we get:

1.23 Differentiating the Third Term wrt Θ

On plugging the derivatives of each term back into 1.18, we get:

1.24 First Derivative of Loss Function

Now that we have the first derivate of our Loss Function, we set it to 0 in order to calculate its minima:

1.25 Setting First Derivative to 0 to calculate minima

On getting rid of the constants, that leaves us with:

1.26 Removing Constants from First Derivative

Then, we shift the first term to the right-hand side of the equation as follows:

1.27 On shifting the First Term to RHS

Finally, bearing in mind, it’s Θ that we’re trying to calculate, we shift all the other terms to the right-hand side too, which leaves us with:

1.28 The Normal Equation

The Normal Equation!
So the value of Θ we get from the above equation, 1.28, gives us THAT instance of Θ, where the loss for our Loss Function, J(Θ) from 1.11 is minimum. This value obtained can be plugged into the Hypothesis function that we formulated in back in 1.8. Which was:

1.29 Hypothesis Function

Visualising Θ

1.30 Scatterplot of Area versus Price

While we’ve done our bit in building up a Mathematical intuition of Linear Regression through The Normal Equation, it’s important to visualise what exactly we just did.
In the graph above, 1.30, we visualise our data set, as a scatter plot of the area of the house against its price. At first glance, it seems fairly obvious that the relationship between area and price is linear, i.e., greater the area, more expensive the house. Hence, the choice of fitting a Line using Linear Regression looks justified.

In the graph below, 1.31 below we 3 have different lines where Θ has been set to 3 different values. Common sense dictates that the green line fits the data best amongst the three. The lines in black and blue clearly seem well off the mark.

1.31 Comparision of different Linear Models with various Θ values

Recall the Cost Function from 1.11:

1.32 MSE Cost Function

Below, in 1.33, is a visualisation of different Weight Values, Θ in the interval -70 to 400, against their respective Costs, J(Θ). Also marked in “x”, are the losses of the above lines for their respective values of Θ.

1.33 Weight(s), Θ Versus Cost, J(Θ)

From the visualisations in 1.31 and 1.33, it’s evident that what seems to be the best fit line (in green), takes up a value of Θ, for which it’s cost, J(Θ) is more or less at its least.
And The Normal Equation helps to calculate this particular value of Θ.

1.34 The Normal Equation

The Catch

Calculating the optimal weights, Θ is at the crux of Supervised Machine Learning for the most part. Given how tedious implementing Gradient Descent (and variants) is, the prospect of doing so by virtue of a simple equation sounds a little too good to be true, which unfortunately it is.

1.35 Dimensionality of Matrix to Invert

The fallacy faced while using The Normal Equation in real-world data sets is due to the need to calculate the Inverse a Matrix, which is computationally very expensive. The Complexity of Inverting a Matrix is O(); meaning as the number of features for our dataset increases, the complexity to invert the Matrix grows cubically! So whilst, The Normal Equation is great to work with small datasets with relatively fewer features, it’s likely to take obscenely long while working with real-world datasets.

You can find my Notebook implementation of the Normal Equations here.

--

--