Understanding Linear and Polynomial Regression in Few Steps

Snigdha Sen
The Startup
Published in
6 min readOct 6, 2020

Artificial Intelligence (AI) and Machine learning (ML) are buzzwords in recent times. The popularity and robustness of ML lies in its numerous algorithms designated for several data analysis tasks. Through this blog I try to give an understanding of Linear and polynomial regression, two popular ML algorithms in just few steps.

Ways to execute ML program

  • Download and install Anaconda Navigator and use Jupyter Notebook
  • Use notepad, text pad or any text editor to write your program and save with .py and run with python <filename>.py from command prompt
  • Use Google Colaboratory, Google’s cloud servers which offers GPU(Graphics processing unit) and TPU(Tensor processing unit) and jupyter notebook like interface to run ML programs

Broadly data science major task are of two categories. Figure 1 is evident enough to understand difference between two.

Figure 1: Difference between regression and classification problem

Linear Regression:.

The term linear implies straight, without much complexity isn’t it? Regression problem is the simplest and easiest algorithm in any machine learning task. Whenever you require to predict something for future just go for with regression algorithm.

Need of linear regression:

Figure 2

From Figure 2, it is very obvious that based on area in square feet, price is being determined. Here we want to predict for 2400 sq ft area, what can be the estimated price? To handle this kind of problem we choose regression model. It tries to capture the relationship between independent and dependent variable.

The variable we are predicting is termed as Y (Target/Dependent variable) — Price in this case

The variable we are using as input is termed as X (Input/Independent variable)- Area in this case

Figure 3: Linear Regression

Figure 3 helps us to understand the equation of straight line and the meaning of independent and dependent variable. Error is calculated as a difference between Actual value and Predicted value. The objective of linear regression is to find a straight line which best fits the data points scattered in X-Y axis.

Solved Example

Sample problem

Solution: The formula of a and b goes as below. From there we need to find a and b value and finally the straight line.

To solve this problem easily, we will create a table and calculate the following values.

Solution to Linear regression problem

Putting the values in above equation we will get value of a .9 and b is 2.2

a) Now that we have the least square regression line y = 0.9 x + 2.2

b) Substitute x by 10 to find the value of the corresponding y.
y = 0.9 * 10 + 2.2 = 11.2

Polynomial Regression

When data samples can not be fit through a straight line with linear regression, when the datasets is much more complex, when relationship between data points are complex, we need to look towards Polynomial regression

x is input whereas Y is output

To represent or capture complex relationship between data samples , we have to take power of the input features.

Sample Program

Here, I have given two sample code one on linear regression and other one on polynomial regression. The datasets used here is Covid datasets of Karnataka state collected from Kaggle, where no of day is X(independent variable) and Confirmed cases is Y(dependent variable). The objective of the program is to predict number of confirmed cases after certain number of days.

#Linear Regression model on Covid data set

#importing python libraryimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn import metricsfrom sklearn.metrics import mean_squared_error,r2_score#Read datasetdataset = pd.read_csv(“/content/drive/My Drive/Workshop_ML_GAT/Karnataka.csv”)# describing datasetdataset.describe()X = dataset[‘Day’].values.reshape(-1,1) #reshape() is used to convert from 1-D to 2-Dy = dataset[‘Confirmed’].values.reshape(-1,1)#Calling linear regression functionregressor = LinearRegression()#Training model using fit()regressor.fit(X,y)#Predicting modely_pred=regressor.predict(X)#Visualizing the outputplt.scatter(X, y, color = ‘magenta’)plt.plot(X, y_pred, color = ‘green’)plt.title(‘Daywise confirmed case in Karnataka’)plt.xlabel(‘NoofDay’)plt.ylabel(‘Confirmed case’)plt.show()#Calculating error metricrmse = np.sqrt(mean_squared_error(y,y_pred))r2 = r2_score(y,y_pred)print(‘RMSE is ‘+ str(rmse))print(‘r2 is ‘+str(r2))
Linear Regression Model

From the above plot it is clearly visible that linear regression is not suitable for this datasets. So requirement of polynomial regression arises.

#polynomial Regression

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn import metricsfrom sklearn.metrics import mean_squared_error,r2_scoredataset = pd.read_csv(“/content/drive/My Drive/Workshop_ML_GAT/Karnataka.csv”)dataset.describe() # describing datasetX = dataset[‘Day’].values.reshape(-1,1)y = dataset[‘Confirmed’].values.reshape(-1,1)#Converting input feature into their higher orderpoly=PolynomialFeatures(degree = 3)X_poly = poly.fit_transform(X)regressor = LinearRegression()regressor.fit(X_poly,y)y_poly_pred=regressor.predict(X_poly)#Visualizing outputplt.scatter(X, y, color = ‘magenta’)plt.plot(X, y_poly_pred, color = ‘green’)plt.title(‘Daywise confirmed case in Karnataka’)plt.xlabel(‘No of Day’)plt.ylabel(‘Confirmed Cases’)plt.show()rmse = np.sqrt(mean_squared_error(y,y_poly_pred))r2 = r2_score(y,y_poly_pred)print(‘RMSE is ‘+ str(rmse))print(‘r2 is ‘+str(r2))
Polynomial Regression Model

If we compare above two output , in case of polynomial regression RMSE has reduced drastically whereas r2 score increases which indicates polynomial regression suits reasonably well for this datasets.

Regression model performance evaluator metrics: Most commonly used metrics are

Mean squared error(MSE)
Mean Absolute Error(MAE)
Root mean squared error(RMSE)

In addition to that, r2 score is also considered as a important metric for which best score is 1.0. But negative value also possible.

Application of Regression in Real Life

• What will be stock price for next year?

• What will be temperature in Bangalore Tomorrow?

• What will my salary after 20 years?

• How many confirmed Covid cases in Bangalore on 25th October?

Conclusion: Although linear regression is simpler and easy to understand, in practical , datasets are not that linear to be predicted with linear regression most of the times. Through regularization techniques over fitting of model can be controlled to some extent. Although Artificial Intelligence, Machine learning (ML) is trending all over world, people should be careful while applying ML considering its application is actually required or not for their work as some data analysis can be done without using ML too.

Thanks for reading:)

--

--

Snigdha Sen
The Startup

I am currently pursuing PhD in Machine Learning and Big Data analytics from IIIT, Allahabad. I am a Data Science enthusiast.