Polynomial Regression in Python

Shuvrajyoti Debroy
6 min readFeb 5, 2023

--

Machine Learning Regression Algorithm

Introduction

Polynomial Regression is a type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth-degree polynomial. It is an extension of linear regression, which only models linear relationships between variables.

In Polynomial Regression, the relationship between x and y is represented by a polynomial equation, which can capture more complex relationships, including curvature and other non-linear patterns. The degree of the polynomial equation determines the level of complexity in the relationship. The simplest form of Polynomial Regression is a second-degree polynomial, or a quadratic equation, which models the relationship as a parabolic curve.

Equation of Polynomial Regression

The general form of a Polynomial Regression equation is:

where y is the dependent variable, x is the independent variable and β0, β1, β2, β3, …, βn are the coefficients of the polynomial equation, n is the degree of the polynomial equation

The coefficients β0, β1, β2, β3, …, βn are estimated from the data using regression analysis methods such as least squares or maximum likelihood. The goal is to find the values of these coefficients that provide the best fit to the observed data. The resulting polynomial equation can then be used to make predictions for new values of x.

Implement Polynomial Regression in Python

To perform Polynomial Regression, the data is first plotted and analyzed to determine the best-fitting polynomial equation. The polynomial regression model is then trained by adjusting the coefficients of the polynomial terms to minimize the difference between the observed and predicted values of the dependent variable. The resulting equation can then be used to make predictions for new data.

In this example, we will use the position salary data concerning the position and salary of employees. In this dataset, we have three columns Position, Level and Salary

Step 1: Import the required python packages

We need Pandas for data manipulation, NumPy for mathematical calculations, MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for machine learning operations

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

Step 2: Load the dataset

Download the dataset from here and upload it to your notebook and read it into the pandas dataframe.

# Get dataset
df_sal = pd.read_csv('/content/Position_Salaries.csv')
df_sal.head()

Step 3: Data analysis

Now that we have our data ready, let’s analyze and understand its trend in detail. To do that we can first describe the data below -

# Describe data
df_sal.describe()

Here, we can see Salary ranges from 45000 to 1000000 and a median of 130000.

We can also find how the data is distributed visually using Seaborn distplot

# Data distribution
plt.title('Salary Distribution Plot')
sns.distplot(df_sal['Salary'])
plt.show()

A distplot or distribution plot shows the variation in the data distribution.
It represents the data by combining a line with a histogram.

Then we check the relationship between Salary and Level -

# Relationship between Salary and Level
plt.scatter(df_sal['Level'], df_sal['Salary'], color = 'lightcoral')
plt.title('Salary vs Level')
plt.xlabel('Level')
plt.ylabel('Salary')
plt.box(False)
plt.show()

It is clearly visible now, our data varies like a polynomial parabolic curve. That means an individual's Salary grows exponentially as their Level increases.

Step 4: Split the dataset into dependent/independent variables

Experience (X) is the independent variable
Salary (y) is dependent on experience

# Splitting variables
X = df_sal.iloc[:, 1:-1].values # independent
y = df_sal.iloc[:, -1].values # dependent

Step 5: Train the regression model

We are going to train the Polynomial Regression model along with the Linear Regression model to compare both results in the end. Pass the X and y data into the regressor models.

# Train linear regression model on whole dataset
lr = LinearRegression()
lr.fit(X, y)

# Train polynomial regression model on the whole dataset
pr = PolynomialFeatures(degree = 4)
X_poly = pr.fit_transform(X)
lr_2 = LinearRegression()
lr_2.fit(X_poly, y)

Step 6: Predict the result

Here comes the interesting part, when we are all set and ready to predict any value of y (Salary) dependent on X (Position, Level) with the trained model using regressor.predict

# Predict results
y_pred_lr = lr.predict(X) # Linear Regression
y_pred_poly = lr_2.predict(X_poly) # Polynomial Regression

Visualize predictions

Its time to test our predicted results by plotting graphs

  • Prediction with Linear Regression
    First, we plot the graph between actual data and predicted values by Linear Regression to the regressor line.
# Visualize real data with linear regression
plt.scatter(X, y, color = 'lightcoral')
plt.plot(X, lr.predict(X), color = 'firebrick')
plt.title('Real data (Linear Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend(['X/y_pred_lr', 'X/y'], title = 'Salary/Level', loc='best', facecolor='white')
plt.box(False)
plt.show()
  • Prediction with Polynomial Regression
    Then we check the same data with the predicted values by Polynomial Regression and we get a curve touching precisely the actual data points.
# Visualize real data with polynomial regression
X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'lightcoral')
plt.plot(X, lr_2.predict(X_poly), color = 'firebrick')
plt.title('Real data (Polynomial Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend(['X/y_pred_poly', 'X/y'], title = 'Salary/Level', loc='best', facecolor='white')
plt.box(False)
plt.show()

Example

Let’s check with an example of Level 7.5 and see what Salary our models predict and how accurate it is.

# Predict a new result with linear regression
print(f'Linear Regression result : {lr.predict([[7.5]])}')

# Predict a new result with polynomial regression
print(f'Polynomial Regression result : {lr_2.predict(pr.fit_transform([[7.5]]))}')

It is very clear now, that for exponential data, our Polynomial Regression model predicts result with higher accuracy.

Full Code at GitHub

You can get the full code in my GitHub repository

Conclusion

In conclusion, polynomial regression is a useful tool for modeling complex relationships between variables. It can provide more accurate predictions than linear regression for certain types of data.

One of the main advantages of polynomial regression is that it can fit non-linear relationships between variables, which can provide a more accurate representation of the underlying relationship in the data. However, a high degree polynomial can also lead to overfitting, where the model fits the noise in the data instead of the underlying relationship. This can result in poor predictions for new data. To mitigate this issue, techniques such as cross-validation, regularization, and model selection can be used to select an appropriate degree for the polynomial equation.

I hope this post served as a good introduction to Polynomial Regression.

--

--