An Introduction to Machine Learning in Python: Polynomial Regression

5 min readMay 22, 2023

Polynomial regression can identify a nonlinear relationship between an independent variable and a dependent variable.

Background

This article is the third in a series on regression, gradient descent, and MSE. The previous articles cover Simple Linear Regression, The Normal Equation for Regression, and Multiple Linear Regression.

Polynomial Regression

Polynomial regression is used on complex data that would be best fit with curves. It can be treated as a subset of multiple linear regression.

Note that X₀ is a column of ones for the bias; this allows for the generalized formula discussed in the first article. Using the equation above, each “independent” variable can be considered an exponentiated version of X₁.

This allows the same model to be used from multiple linear regression since only the coefficients of each variable need to be identified. A simple, third-degree polynomial model can be created as an example. Its equation follows:

The generalized functions for the model, gradient descent, and the MSE can be used from the previous articles:

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)

# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """
  return torch.mean((Yhat-Y)**2) # mean((error)^2)

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

Creating the Data

Now, all that is required is some data to train the model with. A “blueprint” function can be used, and randomness can be added. This follows the same approach as the previous articles. The blueprint can be seen below:

A train set with a size of (800, 4) and a test set with a size of (200, 4) can be created. Note that each feature, except the bias, is an exponentiated version of the first.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# features
X0 = torch.ones((1000,1))
X1 = (100*(torch.rand(1000) - 0.5)).reshape(-1,1) # generates 1000 random numbers from -50 to 50
X2, X3 = X1**2, X1**3
X = torch.hstack((X0,X1,X2,X3))

# normal distribution with a mean of 0 and std of 8
normal = torch.distributions.Normal(loc=0, scale=8)

# targets
Y = (3*X[:,3] + 2*X[:,2] + 1*X[:,1] + 5 + normal.sample(torch.ones(1000).shape)).reshape(-1,1)

# train, test
Xtrain, Xtest = X[:800], X[800:]
Ytrain, Ytest = Y[:800], Y[800:]

After defining the initial weights, the data can be plotted with the line of best fit.

torch.manual_seed(5)
w = torch.rand(size=(4, 1))
w

tensor([[0.83],
        [0.13],
        [0.91],
        [0.82]])

import matplotlib.pyplot as plt

def plot_lbf():
  """
    Output:
      prints the line of best fit in comparison to the train and test data
  """

  # plot the train and test sets
  plt.scatter(Xtrain[:,1],Ytrain,label="train")
  plt.scatter(Xtest[:,1],Ytest,label="test")

  # plot the line of best fit
  X1_plot = torch.arange(-50, 50.1,.1).reshape(-1,1) 
  X2_plot, X3_plot = X1_plot**2, X1_plot**3
  X0_plot = torch.ones(X1_plot.shape)
  X_plot = torch.hstack((X0_plot,X1_plot,X2_plot,X3_plot))

  plt.plot(X1_plot.flatten(), model(w, X_plot).flatten(), color="red", zorder=4)

  plt.xlim(-50, 50)
  plt.xlabel("$X$")
  plt.ylabel("$Y$")
  plt.legend()
  plt.show()

plot_lbf()

Training the Model

To partially minimize the cost function, a learning rate of 5e-11 and 500,000 epochs can be used with gradient descent.

lr = 5e-11
epochs = 500000

# update the weights 1000 times
for i in range(0, epochs):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100000 == 0:
    print("epoch:", i+1)
    print("weights:", w)
    print("Train MSE:", MSE(model(w,Xtrain), Ytrain))
    print("Test MSE:", MSE(model(w,Xtest), Ytest))
    print("="*10)

plot_lbf()

epoch: 100000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.87)
Test MSE: tensor(162.55)
==========
epoch: 200000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.52)
Test MSE: tensor(162.22)
==========
epoch: 300000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.19)
Test MSE: tensor(161.89)
==========
epoch: 400000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(162.85)
Test MSE: tensor(161.57)
==========
epoch: 500000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(162.51)
Test MSE: tensor(161.24)
==========

Even with 500,000 epochs and an extremely small learning rate, the model fails to identify the first two weights. While the current solution is highly accurate with an MSE of 161.24, it would likely require millions of epochs to completely minimize it. This is one of the limitations of gradient descent for polynomial regression.

The Normal Equation

As an alternative, the Normal Equation from the second article can be used to directly compute the optimized weights:

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

w = NormalEquation(Xtrain, Ytrain)
w

tensor([[4.57],
        [0.98],
        [2.00],
        [3.00]])

The Normal Equation is able to immediately identify the correct values for each weight, and the MSE for each set is about 100 points lower than with gradient descent:

MSE(model(w,Xtrain), Ytrain), MSE(model(w,Xtest), Ytest)

(tensor(60.64), tensor(63.84))

Conclusion

With simple linear, multiple linear, and polynomial regression implemented, the next two articles will cover Lasso and Ridge regression. These types of regression introduce two important concepts in machine learning: overfitting and regularization.

Please don’t forget to like and follow! :)