Polynomial regression in Machine Learning: A mathematical guide

7 min readJun 7, 2024

Until part 3, we discussed about Linear regression models. But what if your data is actually more complex than a simple straight line? Surprisingly, you can actually use a linear model to fit nonlinear data. A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression.

Polynomial Regression

Let’s generate some nonlinear data, based on a simple quadratic equation (plus some random noise):

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1) # the quadratic equation

Clearly a straight line will never fit this data properly. So let’s use Scikit-Learn’s PolynomialFeatures class to transform our training data, adding the square of each feature in the training set as new features (in this case there is just one feature):

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X_poly[0]

X_poly will show the original feature of X plus the square of this feature:

array([-0.75275929, 0.56664654])

Now you can fit a LinearRegression model to this extended training data:

lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

The model estimates y = 0.56x₁² + 0.93x₁ + 1.78 when in fact the original function was y = 0.5x₁² + 1.0x₁ + 2.0 + Gaussian noise.

Note that when there are multiple features, Polynomial Regression is capable of finding relationships between features (which is something a plain Linear Regression model cannot do). This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree.

For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a², a³, b², and b³, but also the combinations ab, a²b, and ab².

Evaluating your model

If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression. For example, the following figure applies a 300-degree polynomial model to our training data above, and compares the result with a pure linear model and a quadratic model:

Notice how the 300-degree polynomial model wiggles around to get as close as possible to the training instances. Of course, this high-degree Polynomial Regression model is severely overfitting the training data, while the linear model is underfitting it. The model that will generalize best in this case is the quadratic model.

This makes sense, since the data was generated using a quadratic model, but in general you won’t know what function generated the data, so how can you tell that your model is overfitting or underfitting the data and decide how complex the model should be?

Cross validation

One approach is to use cross-validation to get an estimate of a model’s generalization performance. This is where you split the training set into smaller training sets and a validation set (fold), and then train your models against the smaller training set and evaluate them against the validation set. For this, we can use Scikit-Learn’s K-fold cross-validation class.

For example, if K=10, this class randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores.

If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting.
If it performs poorly on both, then it is underfitting.

This is one way to tell when a model is too simple or too complex.

Learning curves

Another approach is to look at the learning curves: these are plots of the model’s performance on the training set and the validation set as a function of the training set size (or the training iteration). To generate the plots, simply train the model several times on different sized subsets of the training set.

The following code defines a function that plots the learning curves of a linear model given some training data:

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
  X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
  train_errors, val_errors = [], []
  for m in range(1, len(X_train)):
    model.fit(X_train[:m], y_train[:m])
    y_train_predict = model.predict(X_train[:m])
  y_val_predict = model.predict(X_val)
  train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
  val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

Let me explain what’s going on here:

First, let’s look at the performance on the training data:

When there are just one or two instances in the training set, the model can fit them perfectly, which is why the curve starts at zero. But as new instances are added to the training set, it becomes impossible for the model to fit the training data perfectly, both because the data is noisy and because it is not linear at all. So the error on the training data goes up until it reaches a plateau, at which point adding new instances to the training set doesn’t make the average error much better or worse.

Now let’s look at the performance on the validation data:

When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error is initially quite big. Then as the model is shown more training examples, it learns and thus the validation error slowly goes down. However, once again a straight line cannot do a good job modeling the data, so the error ends up at a plateau, very close to the other curve.

These learning curves are typical of an underfitting model. Both curves have reached a saddling point (plateau); they are close and fairly high.

One very important observation here is that: if your model is underfitting the training data, adding more training examples will not help. You need to use a more complex model or come up with better features.

Now let’s look at the learning curves of a 10th-degree polynomial model on the same data

from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)

These learning curves look a bit like the previous ones, but there are 2 very important differences:

The error on the training data is much lower than with the Linear Regression model.
There is a gap between the curves. This means that the model performs significantly better on the training data than on the validation data, which is the hallmark of an overfitting model. However, if you used a much larger training set, the two curves would continue to get closer.

The Bias/Variance Tradeoff

An important theoretical result of statistics and Machine Learning is the fact that a model’s generalization error (how accurately an algorithm is able to predict outcome values for previously unseen data) can be expressed as the sum of three very different errors:

Bias

This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data.

Variance

This part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data.

Irreducible error

This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).

Increasing a model’s complexity will typically increase its variance and reduce its bias. This is why it is called a tradeoff.

Overcoming underfitting

Generally, if your model underfits the data, you can:

Increase model complexity
Increase the number of features / Perform Feature Engineering
Remove noise from data
Increase the number of Epochs or the training time

In the next part we’ll talk about extensively how to overcome overfitting and underfitting! Thanks for reading 🎉