Polynomial Regression — The “curves” of a linear model

Dipanshu Prasad
Analytics Vidhya
Published in
4 min readSep 14, 2020
The overlooked truth

The most glamorous part of a data analytics project/report is, as many would agree, the one where the Machine Learning algorithms do their magic using the data. However, one of the most overlooked part of the process is the preprocessing of data.

A lot more significant effort is put into preparing the data to fit a model on rather than tuning the model to fit the data better. One such preprocessing technique that we intend to disentangle is Polynomial Regression.

A deeper dive

The primary assumption of Polynomial Regression is that there might exist a non-linear relationship between the features (independent variables) and the target (dependent variable). It is also used when the linear model is unable to capture the trend in the data and gives a poor R² score. In this case, Polynomial Regression increases the model complexity by adding “new” features from existing ones using their higher powers and combinations.

Polynomial Regression exposes the interactions between the features and target and interactions among the features, if any. Linear models such as Linear Regression and Logistic Regression can be made much more powerful and complex using Polynomial Regression.

One downside of Polynomial Regression is that it requires a lot of experimenting with its parameters as there is no hard and fast rule associated with it.

Visualizing Polynomial Regression

Let’s start with some data points distributed normally around the cubic curve y = 3x³-2x²+x.

Preparing the data to fit a linear model with polynomial features on

At the first glance, it seems obvious that a simple linear model would miss the complex cubic trend in the data and result in an underfit model. Hence, there’s a need for some tweaks, which follow next.

The void between the simple linear model and the complex trend in data can be filled using the PolynomialFeatures class present in sklearn.preprocessing.

Consider a feature matrix X, containing three features X1,X2,X3. Creating new polynomial features of degree 2 present us these new features:

1, X1², X2², X3² , X1*X2, X1*X3, X2*X3

These features are then used in addition with the original features for predictions and our linear model evaluates coefficients for these new features accordingly.

sklearn PolynomialFeatures has three parameters:

  • degree: it determines the highest power of the new polynomial features
  • include_bias: when set as True, it will include a constant term in the set of polynomial features. It is True by default.
  • interaction_only: when set as True, it will only include the interaction terms of the features and not the higher powers of a single feature. It is False by default.

Applying what we’ve learnt so far to the above synthetic data set:

We introduce polynomial features of degree 3 to help improve the performance and the model fits the data beautifully now.

There is a strong correlation between the degree of polynomial features and the model’s overall complexity. Hence, the value of degree must be chosen after exhaustive experimenting.

Now looking at the effects of improper choice of degree:

Low value of degree; too simple model; underfit
High value of degree; too complex model; overfit
There is a huge jump in the score between degrees 2 and 3. If we go anywhere beyond 3, we are simply overfitting the data and thus we’re getting a higher training data score at the cost of poor generalization.

Before using Polynomial Features

As PolynomialFeatures uses existing features, it is important to ensure the correctness of the original features.

Missing values in the data poses a big problem when using PolynomialFeatures. It will yell at you when it encounters missing data. Hence, it is important to handle missing data before implementing it.

The strategy of filling the missing data also plays a key role.

Consider the case where all the missing values are replaced by a constant value, and that value is 0. In such as case, all the interaction terms will become zero and we’ll get unexpected results.

Hence, proper choice of strategy for handling missing values is very important.

KNN , mean and median are usually the appropriate choices for handling missing values when Polynomial Regression is to be performed next.

Leave a clap if you found it useful!

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Dipanshu Prasad
Dipanshu Prasad