Polynomial regression in Machine Learning : Understanding the Process

hassane Skikri
6 min readFeb 3, 2024

let’s discover How does polynomial regression work and how can we implement it? 😊🌟

m
Polynomial regression | GIF: Towards Data Science

🔍📚 Outline 📚🔍

  • 🌟 Introduction to Polynomial Regression 🌟
  • 🔢 Polynomial Regression Formula 🔢
  • 🚀 The difference between linear and polynomial regression. 🚀
  • 📚 Determining the degree of the polynomial 📚
  • 🛠️ Implementation of the Polynomial Regression 🛠️
  • 🌍 Application 🌍
  • 🎯 Conclusion 🎯

🌟 Introduction to Polynomial Regression 🌟

Polynomial Regression is a type of linear regression where the relationship between the input variable (x) and the output variable (y) is expressed as a polynomial. In simpler terms, it’s like fitting a curved line instead of a straight line to the data points. This curve represents how y changes as x is raised to different powers, like x, x², x³, and so on. This approach is especially useful when the data shows a pattern that isn’t a straight line, indicating that the relationship between x and y is more complex than just increasing or decreasing at a constant rate.

🔻🔢 Polynomial Regression Formula 🔢🔻

Polynomial Regression is a statistical technique that models the relationship between a dependent variable y and an independent variable x as an nth degree polynomial. It’s an extension of linear regression, used when the data shows a non-linear relationship. The model takes the form

🔻The difference between linear and polynomial regression.🔻

Here’s a revised version of the explanation, organized for better clarity and flow:

Linear Regression and Polynomial Regression are statistical methods designed to understand the relationship between variables. However, the way they model this relationship is fundamentally different, catering to various patterns observed in data.

Linear Regression

Consider a scenario where you’re analyzing how salaries increase with years of experience. By plotting years of experience (X-axis) against salary (Y-axis), you observe a trend: as experience grows, so does salary. Linear regression captures this by drawing a straight line through the data points. This line suggests a constant rate of salary increase per year of experience, embodying a simple and direct relationship between the two variables.

gif credits : https://medium.com/@kabab/linear-regression-with-python-d4e10887ca43

Polynomial Regression

Now, imagine a more nuanced scenario where salary growth is not constant. Initially, salary increases rapidly with experience, then the growth accelerates, and after reaching a certain point, it starts to slow down. This pattern cannot be represented by a straight line due to its complexity. Polynomial regression addresses this by fitting a curved line, such as a parabola, through the data points. This curve better reflects the real-world situation where salary growth rates change over time, showing quick increases in the early years, reaching a peak, and then gradually decelerating.

Linear Regression | GIF: Towards Data Science
m
Polynomial regression | GIF: Towards Data Science

🔻 Determining the degree of the polynomial 🔻

Choosing the right level of complexity for a curve (or “degree” in math speak) when we’re trying to understand data with polynomial regression is a bit like Goldilocks finding the bed that’s just right. Not too simple, not too complicated, but just perfect.

First off, just look at your dataplotted on a graph. Sometimes, it's pretty obvious if the data looks more like a gentle hill or has lots of ups and downs. This can give you a good starting point.

Use cross-validation techniques to evaluate how well models of different degrees generalize to unseen data. K-fold cross-validation is commonly used, where the data is split into K subsets. The model is trained on K-1 of these subsets and validated on the remaining subset, with the process repeated K times so that each subset is used as the validation set once. The degree that results in the lowest average validation error is typically chosen.

Combined with cross-validation, perform a grid searchover a predefined range of polynomial degrees. Evaluate the performance of each model using a suitable error metric (such as Mean Squared Error for regression tasks) and select the degree that minimizes this error on the validation set.

🔻🛠️ Implementation of the Polynomial Regression 🛠️🔻

Implementing polynomial regression involves several key steps, from preparing your data to selecting the right polynomial degree and finally evaluating the model’s performance.

Step 1: Data Preparation

gather and clean your data and remember to romove any outliers or missing values . Remeber that polynomial regression is very sensitive to outliers

Step 2: Selection the polynomial Dagree

Use visual inspection or cross-validation to choose the right degree for the polynomial, balancing simplicity and accuracy.

Step 3: Model Training

Train a linear regression model on the transformed polynomial features to fit the nonlinear relationship.

Step 4: Model Evaluation

Assess the model’s performance using metrics like R-squared or Mean Squared Error on a validation set.

🔻Application and Examples🔻

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

X = np.arange(0, 40)
y = [1, 4, 5, 7, 8, 6, 9, 10, 10, 23, 25, 44, 50, 63, 67, 64, 62, 70, 75, 88, 90, 92, 95, 100, 108, 135, 151, 160, 169, 172,173,176,175,175,176,178,179,180,190,201]
# First, let's plot the original data to see the relationship between X and y

plt.scatter(X, y, color='blue', label='Actual Data')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Scatter Plot of Original Data')
plt.legend()
plt.show()


# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Setting up a pipeline
pipeline = Pipeline([
('poly', PolynomialFeatures()),
('linear', LinearRegression())
])

# Parameters for grid search

parameters = {'poly__degree': np.arange(1, 5)}

grid_search = GridSearchCV(pipeline, parameters, cv=4, scoring='neg_mean_squared_error')
grid_search.fit(X_train.reshape(-1,1), y_train)

best_degree = grid_search.best_params_['poly__degree']


poly_best = PolynomialFeatures(degree=best_degree)
X_poly_train_best = poly_best.fit_transform(X_train.reshape(-1,1))
X_poly_test_best = poly_best.transform(X_test.reshape(-1,1))

model_best = LinearRegression()
model_best.fit(X_poly_train_best, y_train)

# Making predictions with the best model
predictions_train = model_best.predict(X_poly_train_best)
predictions_test = model_best.predict(X_poly_test_best)

# Evaluating the model
train_error = mean_squared_error(y_train, predictions_train)
test_error = mean_squared_error(y_test, predictions_test)
train_accuracy = r2_score(y_train, predictions_train)
test_accuracy = r2_score(y_test, predictions_test)


best_degree, train_error, test_error, train_accuracy, test_accuracy
png
polynomila regression | Image: jupyter notebook
(3,
82.12057919191898,
93.19555930834686,
0.9833308707514882,
0.9716276725394738)
plt.scatter(X_train, y_train, color='blue', label='Training Data')
plt.scatter(X_test, y_test, color='green', label='Test Data')

X_range = np.arange(0, 40).reshape(-1,1)
X_range_poly = poly_best.transform(X_range)

predictions_range = model_best.predict(X_range_poly)

# Plotting the polynomial regression fit
plt.plot(X_range, predictions_range, color='red', label='Polynomial Regression Fit')

plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression Fit with Training and Test Data')
plt.legend()
plt.show()
png
polynomial regression | Image: jupyter notebook

🔻Conclusion🔻

Polynomial regression is a versatile tool with applications in diverse domains. When addressing non-linear relationships, it requires careful consideration of overfitting and model complexity.

Finally, if you found this article useful, you can check out my repository on GitHub. It covers all the skills you need to become a data scientist. roadmap here

You can download the code here

--

--

hassane Skikri

A technology enthusiast who likes writing about different technologies including Python, Data Science, Machine learning, data analyse, Java, etc.