Linear Regression: The Normal Equation
Linear Regression is probably one of the simplest yet, incredibly powerful Machine Learning Algorithms that are in use today. As the name suggests, we make use of this algorithm to solve regression problems, where the “target” we’re trying to model our learning algorithm to predict takes up continuous values.
Like other Supervised Machine Learning Algorithms, Linear Regression follows the same flow, where we feed into our Learning Algorithm a set of training examples containing a set of features, traditionally denoted as “X” and outcomes, denoted by “y”. The Algorithm “learns” from these examples and generates a model or a hypothesis, that if trained properly should ideally be able to make accurate predictions to unseen data.
In this piece, I will attempt to break down the underlying Philosophy, Mathematics, and hopefully, effectively implement Linear Regression, using The Normal Equation. We will be making use of the Portland Housing Dataset as our “toy data”.
The Philosophy And Mathematics
Below is a snapshot of our housing dataset.
| Area (Square Feet) | Number of Rooms | Price ($) |
| ------------------ | --------------- | ---------- |
| 2104 | 3 | 399900 |
| 1600 | 3 | 329900 |
| 4478 | 5 | 699900 |
From the training data, we observe 2 features, Area, and Number of Rooms, and an outcome, Price associated with it. The entire dataset has 47 instances of these. The goal here is to leverage the 2 features, and their respective outcomes we are afforded from the 47 examples to build a predictor that can effectively “guess” the price of a house if we’re given the aforementioned features for a “new house” that isn’t in this dataset.
In X, we will be storing our features, and in y, their corresponding prices. That leaves us with the following Matrices:
Our goal is to create a function that can map X to y, such that:
We define our “approximation function” or “hypothesis” with a linear equation, as follows:
In the above equation, 1.3, we’ve introduced a new set of terms, θ₀, θ₁, and θ₂. In order to accomplish a hypothesis as defined in 1.2, we need to tune θ₀, θ₁, and θ₂ to best fit the equation. These θ terms are commonly referred to as “weights” or “parameters”, and like X and y we store them in a matrix as follows:
Our Linear Equation in 1.3 can be rewritten as the Matrix dot product of X and Θ, but we encounter a dimensionality mismatch; X has m rows and n columns, whereas Θ is a row vector with (n+1) columns. To overcome this dimensionality mismatch, and accommodate the nought term θ₀, we “pad” X with an extra column of 1s. In essence, all we’re doing is:
Now, the question might arise, why are we going through all this trouble to accommodate θ₀? Why can’t we just drop the term altogether?
To answer that, recall the equation of a line:
The above equation in 1.6 has two components, m the slope, and c the intercept. The value of m drives the gradient of the line, while c determines the point at which the line intercepts the y-axis. From our equation in 1.3, we’re essentially trying to accomplish the same thing; Tune θ₁ and θ₂ to control our gradients across the features X₁ and X₂. And tune θ₀ which is analogous to c, to find the ideal intercept value to best fit our data. Following our manipulation of X, we rewrite 1.3 as follows:
And since X₀ is always 1, θ₀X₀ is essentially θ₀. Now that we have the dimensions of X and Θ in compliance, we rewrite 1.7 as the following Matrix Operation:
Now that we have our Hypothesis in place, it’s time to make use of it to tune Θ to map to y. Recall from 1.2, we want this hypothesis, f̂(X) to be as close to y as possible. While we know tuning our weights, Θ is necessary to accomplish this approximation of y, we need some performance metric to gauge how far off a particular instance of Θ’s values are from doing so. To put things simply, all we’re trying to do is:
Hence, we define a “Loss Function” which gives us a “score” for how well or poorly a particular instance of Θ’s values performed to get close to y. The most commonly used Loss Function is the Mean Squared Error (MSE). And that’s what we’ll be using to define our Loss Function as well. All MSE is, is the Sum of all the individual square distances between the hypothesis and the respective outcomes. The following equation denotes the MSE, denoted by J:
It is important to understand that our Loss Function is parametrised by the value of our weights, Θ. In other words, changing Θ changes J(Θ). As stated earlier, the above equation, 1.11, is nothing but the sum of the squared distances between y and f̂(X). Given that, we can define the “Design Matrix” as follows:
On substituting 1.8 into the above equation, 1.12, we get the following:
And on substituting the above-obtained Design Matrix, 1.13 into the lost function from 1.11 we get:
With our loss function in place, we can now go ahead and simplify the above equation as follows:
In the above equation 1.15, pay close attention to the Second and Third terms; both terms produce scalars.
And since the individual components of both terms are the same, they essentially produce the same result. Hence, we can simplify our Loss Function further:
With a simplified Loss Function, 1.17, in place, and the goal of reducing this “loss” as much as possible from 1.9, we can now take our next step of finding an instance of Θ such that J(Θ) becomes minimum. We do so by calculating the Derivative of our Loss Function with respect to Θ.
Starting with the first term:
For the second term, we first factor out the Non-Θ terms:
And then differentiate:
Finally, that leaves us with our last term, and just like we did with our second term, we factor out Non-Θ terms first:
On calculating the derivative of which we get:
On plugging the derivatives of each term back into 1.18, we get:
Now that we have the first derivate of our Loss Function, we set it to 0 in order to calculate its minima:
On getting rid of the constants, that leaves us with:
Then, we shift the first term to the right-hand side of the equation as follows:
Finally, bearing in mind, it’s Θ that we’re trying to calculate, we shift all the other terms to the right-hand side too, which leaves us with:
The Normal Equation!
So the value of Θ we get from the above equation, 1.28, gives us THAT instance of Θ, where the loss for our Loss Function, J(Θ) from 1.11 is minimum. This value obtained can be plugged into the Hypothesis function that we formulated in back in 1.8. Which was:
Visualising Θ
While we’ve done our bit in building up a Mathematical intuition of Linear Regression through The Normal Equation, it’s important to visualise what exactly we just did.
In the graph above, 1.30, we visualise our data set, as a scatter plot of the area of the house against its price. At first glance, it seems fairly obvious that the relationship between area and price is linear, i.e., greater the area, more expensive the house. Hence, the choice of fitting a Line using Linear Regression looks justified.
In the graph below, 1.31 below we 3 have different lines where Θ has been set to 3 different values. Common sense dictates that the green line fits the data best amongst the three. The lines in black and blue clearly seem well off the mark.
Recall the Cost Function from 1.11:
Below, in 1.33, is a visualisation of different Weight Values, Θ in the interval -70 to 400, against their respective Costs, J(Θ). Also marked in “x”, are the losses of the above lines for their respective values of Θ.
From the visualisations in 1.31 and 1.33, it’s evident that what seems to be the best fit line (in green), takes up a value of Θ, for which it’s cost, J(Θ) is more or less at its least.
And The Normal Equation helps to calculate this particular value of Θ.
The Catch
Calculating the optimal weights, Θ is at the crux of Supervised Machine Learning for the most part. Given how tedious implementing Gradient Descent (and variants) is, the prospect of doing so by virtue of a simple equation sounds a little too good to be true, which unfortunately it is.
The fallacy faced while using The Normal Equation in real-world data sets is due to the need to calculate the Inverse a Matrix, which is computationally very expensive. The Complexity of Inverting a Matrix is O(n³); meaning as the number of features for our dataset increases, the complexity to invert the Matrix grows cubically! So whilst, The Normal Equation is great to work with small datasets with relatively fewer features, it’s likely to take obscenely long while working with real-world datasets.
You can find my Notebook implementation of the Normal Equations here.