Linear Regression, Binning and Polynomial Linear Regression

Rishabh Jain
4 min readDec 9, 2019

--

Can Linear Regression fit non-linear data?

As we have learnt from the beginning, a linear regression is is one which fits a straight line in the form of y =wx + b for every (x,y) pair in the best possible way.

But what if I show you a non-linear curve fitting the below data points ?

Lets find out.

I will not be explaining linear regression since there are already a lot of resources on that. I will run the code and quickly jump to binning and polynomial regression.

! pip install mglearn -qimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import mglearn
from sklearn.linear_model import LinearRegression

Create the Dataset

X, y = mglearn.datasets.make_wave(n_samples = 100)
plt.scatter(X[:, 0], y)

So this is how our data looks. Now lets fit a linear model and plot it.

reg = LinearRegression().fit(X, y)# The data ranges from -3 to 3. Lets create points 1000 points which # will be used for prediction
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
line_predict = reg.predict(line)
plt.plot(X[:, 0], y, 'o')
plt.plot(line, line_predict)

Binning

So as we see, linear regression doesn’t fit the data. In other words, we can say that model wasn’t able to understand the data well. What if we transform the data such that it will help the model understand the data well?

Binning means dividing data into intervals to create bins. We replace the value of the data to the bin it falls into. Basically we are digitizing the data.

from sklearn.preprocessing import OneHotEncoder, LabelEncoder# create 10 bins
bins = np.linspace(-3, 3, 11)
X_binned = np.digitize(X, bins = bins)
# X_binned now has values from 1 to 10
# One hot encode the data
encoder = OneHotEncoder(sparse=False)
encoder.fit(X_binned)
X_binned = encoder.transform(X_binned)
# Lets fit linear model now
reg = LinearRegression().fit(X_binned, y)
# transform the line on which model will predict
line_binned = encoder.transform(np.digitize(line, bins = bins))
#plot
plt.plot(line, reg.predict(line_binned), c = 'r', label = 'linear regression binned')
plt.plot(X[:, 0], y, 'o')
plt.title('Binning')
plt.legend(loc = 'best')

Polynomial Linear regression

Binning digitizes the data. This might not be the best fit. So what do we do? we create features such as X**2, X**3, etc from X. Lets see what happens.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=10, include_bias=False)
X_poly = poly.fit_transform(X)
reg = LinearRegression()
reg.fit(X_poly, y)
line_poly = poly.transform(line)
plt.plot(X[:,0], y, 'o')
plt.plot(line, reg.predict(line_poly), c = 'r', label = 'polynomial regression' )
plt.legend(loc = 'best')

Inference

So how does the a linear model fit the non-linear data? Why is it called linear regression then?

Lets write the equation of linear regression once again:-

Credits: @manjabogicevic

As we see for Mutiple/polynomial regression we create new polynomial features from X. From our point of view, it is the same data which has been squared, tripled, etc but for a model it is just a new feature.

Eg: X1 = X, X2 = X **2, X3 =X** 3

This way we create ’n’ number of dimensions. For 1-dimensions, linear regression is a straight line. For 2-dimension it is a plane and as number of dimension increases we can’t plot it and its difficult to imagine. But mathematically its still a linear regression.

The same happens when we do binning. We create 10 new features from each data point by placing them in bins. In the higher dimension the regression is able to map the input to the output better.

So, when we plot the original feature with the output, the non-linear shape of output makes us feel that its not linear regression. But it is indeed!!

Also, it doesn’t mean that regression will be able to fit every kind of non-linear data. Linear regression can only perform better on such data if its able to map the output in higher dimensions.

You can find the complete code on my github here

Let me know your thoughts and corrections if any.

Reference

The wonderful book “Introduction to Machine Learning with Python” one of whose author is himself a core contributor to the scikit-learn library. Its a wonderful book for beginners to understand machine learning concepts intuitively.

That’s all guys. Happy Learning.

--

--