An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python

4 min readMay 22, 2023

The Normal Equation is a closed-form solution for minimizing a cost function and identifying the coefficients for regression.

Background

In the previous article, An Introduction to Machine Learning in Python: Simple Linear Regression, the gradient descent approach was used to minimize the MSE cost function. However, the approach required a large number of epochs and a small learning rate, both of which are difficult to identify in a short amount of time.

An alternative approach is a closed-form solution that does not require a learning rate or epochs. The closed-form solution for regression is known as the Normal Equation. It can be used to directly determine the weights of a line of best fit. It will be derived in this article and then implemented in Python.

Deriving the Normal Equation

In A Simple Introduction to Gradient Descent, the matrix derivative of the MSE was calculated.

This partial derivative can be set equal to 0, which indicates where the cost function is at a minimum for each weight. By solving for w, a direct equation to calculate these values can be identified.

set equal to 0

multiply by n/2

place each term on its own side

transpose both sides

simplify

use the inverse of X^TX to isolate w

simplify

To prove this returns new weights as anticipated, the size of each component can be examined:

The output is a vector with a size of (num features, 1). This is the same size as the original weight vector from the previous article, An Introduction to Machine Learning in Python: Simple Linear Regression.

Implementing the Normal Equation in Python

This equation can be implemented in Python, and the same example from the previous article can be used.

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

With the function created, all that is necessary is some input data, which is generated below:

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# (n samples, features)
X = torch.randint(low=0, high=11, size=(20, 1))

# normal distribution with a mean of 0 and std of 1
normal = torch.distributions.Normal(loc=0, scale=1)

# generate output
Y = (1.5*X + 2) + normal.sample(X.shape)

# add bias column
X = torch.hstack((torch.ones(X.shape),X))

These can be plugged into the Normal Equation to generate the optimized weights:

w = NormalEquation(X, Y)

tensor([[1.97],
        [1.52]])

These weights are nearly identical to the blueprint function. Instead of 2 and 1.5, the equation output 1.97 and 1.52. They aren’t perfect due to the randomness added to the output. Furthermore, these values are more accurate than those from the previous article since a learning rate and specific number of epochs did not have to be selected.

When to Use it

While this approach seems to be preferable over gradient descent, both have their use cases. For simple problems with small datasets, the Normal Equation will suffice. As the dataset grows, so does the size of the inverted matrix, which has a size of (num features, num features). This can be expensive to compute.

When the number of features is large, gradient descent should be used. Gradient descent can also be used to create a generalized equation that does not overfit to the train data.

For the next two articles, both approaches will be used. The next article is An Introduction to Machine Learning in Python: Multiple Linear Regression.

Please don’t forget to like and follow! :)

References

Normal Equation Overview