Linear Regression (Simple, Multiple and Polynomial)

Aditya Kumar
The Startup
Published in
8 min readMay 11, 2020

Linear regression is a model that helps to build a relationship between a dependent value and one or more independent values. It can be simple, linear, or Polynomial. In Simple Linear regression, we have just one independent value while in Multiple the number can be two or more.

A simple linear regression has the following equation

𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋.

A sample graph representing the relation between an independent and dependent variable.

In Data Science, Linear regression is one of the most commonly used models for predicting the result.

As an example, lets try to predict the price of a car using Linear regression.

Let's try to get the price of the automobile using a Regression algorithm

Let's start with importing the libraries needed.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print ('Imported pandas,numpy and matplotlib')
Imported pandas,numpy and matplotlib

Pandas and NumPy will be used for our mathematical models while matplotlib will be used for plotting.

Let's read our data and visualize it.

# path of data 
path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
df = pd.read_csv(path)
df.head()

df.head() will give us the details of the top 5 rows of every column. We can use df.tail() to get the last 5 rows and df.head(10) to get top 10 rows.

The data is about cars and we need to predict the price of the car using the above data

We will be using Linear regression to get the price of the car.For this, we will be using Linear regression. Linear regression works on one independent value to predict the value of the dependent variable.In this case, the independent value can be any column while the predicted value should be price.

Simple Linear Regression 𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋

In this case, a is the intercept(intercept_) value and b is the slope(coef_) value.

Let's load the modules for linear regression

from sklearn.linear_model import LinearRegression#Create a linera regression Object
lm = LinearRegression()
lm

We will take highway-mpg to check how it affects the price of the car.

X = df[['highway-mpg']]
Y = df['price']
lm.fit(X,Y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

We will predict the price of 5 cars.

Yhat=lm.predict(X)
Yhat[0:5]

Out:

array([16236.50464347, 16236.50464347, 17058.23802179, 13771.3045085 ,
20345.17153508])

find the value of intercept(intercept) and slope(coef)

lm.intercept_38423.305858157386lm.coef_

Out:

array([-821.73337832])

Putting this on equation

price = 38423.31–821.73 x highway-mpg

Now let's check if the value we have received correctly matches the actual values. How our model is performing will be clear from the graph

import seaborn as sns
plt.figure(figsize=(5, 7))


ax = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

The above graph shows the model is not a great fit.

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

Let's try Linear regression with another value city-mpg.

lm1 = LinearRegression()
lm1
X1 = df[['city-mpg']]
Y1 = df['price']
lm1.fit(X1,Y1)
Yhat1=lm1.predict(X1)
Yhat1[0:5]

Out:

array([16757.08312743, 16757.08312743, 18455.98957651, 14208.72345381,
19305.44280105])
#Intercept value and coef value
lm1.intercept_

Out:

34595.600842778265lm1.coef_

Out:

array([-849.45322454])

Let's see the graph of the values

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="city-mpg", y="price", data=df)
plt.ylim(0,)

Now we have both the values. Actual as well as the predicted. Let's try to find how much is the difference between the two. We will plot a graph for the same.

import seaborn as sns
plt.figure(figsize=(5, 7))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat1, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

The above graph shows city-mpg and highway-mpg has an almost similar result

Let's see out of the two which is strongly related to the price

df[["city-mpg","horsepower","highway-mpg","price"]].corr()

As per the figure, horsepower is strongly related. Let's try our model with horsepower value.

lm2 = LinearRegression()
lm2

Out

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)X2 = df[['horsepower']]
Y2 = df['price']
lm2.fit(X2,Y2)
Yhat2=lm2.predict(X2)
Yhat2[0:5]

Out:

array([14514.76823442, 14514.76823442, 21918.64247666, 12965.1201372 ,
15203.50072207])
lm2.coef_

Out:

array([172.18312191])lm2.intercept_

Out:

-4597.558297892912

Let's plot a graph to find the correlation

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="horsepower", y="price", data=df)
plt.ylim(0,)
import seaborn as sns
plt.figure(figsize=(5, 7))


ax2 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat2, hist=False, color="b", label="Fitted Values" , ax=ax2)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

The above graph shows horsepower has a greater correlation with the price

In real life examples there will be multiple factor that can influence the price. Like the age of the vehicle, mileage of vehicle etc. In this case the price become dependent on more than one factor

Multiple Linear regression is similar to Simple Linear regression. Here the number of independent factor is more to predict the final result. 𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋1+𝑏2𝑋2+𝑏3𝑋3+𝑏4𝑋4…….

Let's take the following data to consider the final price

Horsepower

Curb-weight

Engine-size

Highway-mpg

peak rpm

city-L/100km

In simple linear regression, we took 1 factor but here we have 6.

Z1 = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','peak-rpm','city-L/100km']]plt.figure(figsize=(7, 5))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

Let's get the coef and intercept value

#create the linear regression object
lm_multi = LinearRegression()
lm_multi
lm_multi.fit(Z1, df['price'])

Let's get the coef and intercept value

lm_multi.coef_

Out:

array([3.75013913e-01, 5.74003541e+00, 9.17662742e+01, 3.70350151e+02,
1.58733026e+00, 1.32242578e+03])Lets predict the match
lm_multi.intercept_

Out

-45782.76360478182

Let's predict the match

Y_hat = lm_multi.predict(Z1)
Y_hat

Out:

array([13548.76833369, 13548.76833369, 18349.65620071, 10462.04778866,
17093.45921256, 15474.35898958, 17408.75092424, 18040.15481983,
18422.7805903 , 11371.66671604, 11371.66671604, 16782.30998818,
17098.01193598, 18391.50046289, 28147.36990933, 29008.37522149,
30280.05458674, 2710.22365762, 5934.07768106, 6134.97892056,
5583.87178415, 6098.47115374, 8252.9513714 , 6620.81437645,
6747.09515557, 6747.09515557, 8614.5736025 , 11992.62104095,
17903.7654461 , 6474.50452529, 6751.48692366, 4496.90009121,
6298.79195888, 6390.63252551, 6700.59443789, 6780.95493369,
10116.56952617, 10420.79140313, 10506.89193435, 10897.21434253,
11023.66235828, 10152.90253148, 9155.09867202, 12489.98914328,
36592.39698599, 36592.39698599, 45042.428185 , 3218.64309329,
5534.3331489 , 5563.03332598, 5792.63474255, 5821.33491963,
10662.6629024 , 10662.6629024 , 10691.36307947, 13424.41734269,
10557.29204607, 10700.79293143, 10557.29204607, 10700.79293143,
11027.92286657, 10786.89346265, 16727.83516996, 13174.29726878,
21522.36784566, 22871.27616805, 21407.56713737, 22986.07687634,
30845.65417912, 30529.95223132, 40202.83932657, 38773.8676783 ,
17015.01898311, 6008.48581986, 6672.32611022, 7016.72823509,
8350.53197344, 11311.47466083, 11027.18391541, 18030.04622521,
18535.16934168, 18563.86951875, 11239.56522574, 11469.16664232,
11500.8958295 , 11500.8958295 , 5969.28131922, 8310.13386944,
6135.74234624, 6250.54305453, 6744.18610017, 6325.16351492,
6767.14624183, 6439.96422321, 6818.80656056, 6652.34553354,
10961.43253298, 10835.15175386, 23331.75600719, 24485.50312548,
22317.63241753, 22383.77291839, 23972.66838377, 22774.09532657,
15781.84631045, 16460.13576888, 16987.25374748, 16166.63449122,
16096.79823041, 16775.83771667, 17302.20566744, 16482.33643902,
16097.54825824, 16775.83771667, 19575.93620354, 5824.95327155,
8252.9513714 , 6620.81437645, 6747.09515557, 8640.71250249,
11992.62104095, 17943.945694 , 19159.47845931, 24821.65296294,
24821.65296294, 25074.21452117, 14301.28374535, 13618.21953103,
14121.08346298, 14333.46477332, 14402.34519829, 14695.08700443,
16214.71638324, 16438.5777644 , 6046.87779307, 6742.04741644,
7430.85166617, 7504.82621108, 7034.16142754, 9652.9448566 ,
7675.42100307, 9885.2014354 , 7872.74692161, 10420.80655947,
9920.71207917, 9029.6947249 , 5018.94038432, 6109.98824602,
5966.48736066, 7117.24659497, 6808.05776153, 11514.88680135,
6862.73865583, 7023.45964743, 7006.99528563, 10118.71054043,
8460.01811276, 6726.95836455, 6830.27900201, 6614.01828337,
6814.91952288, 9422.74161893, 9623.64285844, 13916.74613564,
13893.78599398, 13979.8865252 , 14714.61105824, 14915.51229775,
16413.66154091, 8773.44657442, 8297.73343831, 9331.66264011,
9331.66264011, 9584.22419835, 19733.0157352 , 20780.4330925 ,
20620.84615487, 20635.80006237, 9170.76868719, 9366.7657906 ,
9187.98879344, 9383.98589684, 9745.60812795, 7552.09128443,
9993.55887501, 9610.7745318 , 9421.35336312, 15987.97429289,
8581.18361649, 11606.44565226, 16367.14499947, 17067.42932003,
15936.17657945, 16550.36036879, 18209.69136813, 18852.57533455,
16596.74641605, 18745.57419904, 21945.41342891, 15600.12536986,
18961.6770452 ])

Let's get the graph between our predicted value and actual value.

plt.figure(figsize=(5,7))


ax_multi = sns.distplot(df['price'], hist=True, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=True, color="b", label="Fitted Values" , ax=ax_multi)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

Let's calculate the R square of the model

# fit the model 
lm_multi.fit(Z1, df['price'])
# Find the R^2
print('The R-square is: ', lm_multi.score(Z1, df['price']))

Out:

The R-square is:  0.8313349042564419

The R square value should be between 0–1 with 1 as the best fit. In our case, we can say 0.8 is a good prediction with scope of improvement.

Graph for the actual and the predicted value.

import seaborn as sns

ax2 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax2)

plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

The above graph shows the difference between the actual value and the predicted values

Let's try to evaluate the same result with the Polynomial regression model. Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial.

We will use the following function to plot the data:

def PlotPolly(model, independent_variable, dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)

plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')

plt.show()
plt.close()

We will assign highway-mpg as x and price as y

x = df['highway-mpg']
y = df['price']

Let’s fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.

# Here we use a polynomial of the 2nd order (cubic) 
f = np.polyfit(x, y, 4)
p = np.poly1d(f)
print(p)

Out:

4        3       2
0.02651 x - 5.17 x + 382 x - 1.267e+04 x + 1.657e+05
PlotPolly(p, x, y, 'highway-mpg')

Since we got a good correlation with horsepower lets try the same here.

x1 = df['horsepower']
y1 = df['price']
f1 = np.polyfit(x1, y1, 4)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1, x1, y1, 'horsepower')

Out:

4           3         2
-7.952e-05 x + 0.04269 x - 7.619 x + 698.6 x - 1.632e+04

Let's calculate R2

from sklearn.metrics import r2_score
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)

Out:

The R-square value is:  0.6748405169870639r_squared = r2_score(y1, p(x1))
print('The R-square value is: ', r_squared)

Out:

The R-square value is:  -385107.41247912706

The above results are not very encouraging. As per our model Polynomial regression gives the best fit.

For reference: The output and the code can be checked on https://github.com/adityakumar529/Coursera_Capstone/blob/master/Regression(Linear%2Cmultiple%20and%20Polynomial).ipynb

--

--

Aditya Kumar
The Startup

Data Scientist with 6 years of experience. To find out more connect with me on https://www.linkedin.com/in/adityakumar529/