Linear Regression (Simple, Multiple and Polynomial)
Linear regression is a model that helps to build a relationship between a dependent value and one or more independent values. It can be simple, linear, or Polynomial. In Simple Linear regression, we have just one independent value while in Multiple the number can be two or more.
A simple linear regression has the following equation
𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋.
In Data Science, Linear regression is one of the most commonly used models for predicting the result.
As an example, lets try to predict the price of a car using Linear regression.
Let's try to get the price of the automobile using a Regression algorithm
Let's start with importing the libraries needed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print ('Imported pandas,numpy and matplotlib')Imported pandas,numpy and matplotlib
Pandas and NumPy will be used for our mathematical models while matplotlib will be used for plotting.
Let's read our data and visualize it.
# path of data
path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
df = pd.read_csv(path)
df.head()
df.head() will give us the details of the top 5 rows of every column. We can use df.tail() to get the last 5 rows and df.head(10) to get top 10 rows.
The data is about cars and we need to predict the price of the car using the above data
We will be using Linear regression to get the price of the car.For this, we will be using Linear regression. Linear regression works on one independent value to predict the value of the dependent variable.In this case, the independent value can be any column while the predicted value should be price.
Simple Linear Regression 𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋
In this case, a is the intercept(intercept_) value and b is the slope(coef_) value.
Let's load the modules for linear regression
from sklearn.linear_model import LinearRegression#Create a linera regression Object
lm = LinearRegression()
lm
We will take highway-mpg to check how it affects the price of the car.
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X,Y)LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
We will predict the price of 5 cars.
Yhat=lm.predict(X)
Yhat[0:5]
Out:
array([16236.50464347, 16236.50464347, 17058.23802179, 13771.3045085 ,
20345.17153508])
find the value of intercept(intercept) and slope(coef)
lm.intercept_38423.305858157386lm.coef_
Out:
array([-821.73337832])
Putting this on equation
price = 38423.31–821.73 x highway-mpg
Now let's check if the value we have received correctly matches the actual values. How our model is performing will be clear from the graph
import seaborn as sns
plt.figure(figsize=(5, 7))
ax = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
The above graph shows the model is not a great fit.
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
Let's try Linear regression with another value city-mpg.
lm1 = LinearRegression()
lm1X1 = df[['city-mpg']]
Y1 = df['price']
lm1.fit(X1,Y1)
Yhat1=lm1.predict(X1)
Yhat1[0:5]
Out:
array([16757.08312743, 16757.08312743, 18455.98957651, 14208.72345381,
19305.44280105])#Intercept value and coef value
lm1.intercept_
Out:
34595.600842778265lm1.coef_
Out:
array([-849.45322454])
Let's see the graph of the values
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="city-mpg", y="price", data=df)
plt.ylim(0,)
Now we have both the values. Actual as well as the predicted. Let's try to find how much is the difference between the two. We will plot a graph for the same.
import seaborn as sns
plt.figure(figsize=(5, 7))
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat1, hist=False, color="b", label="Fitted Values" , ax=ax1)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
The above graph shows city-mpg and highway-mpg has an almost similar result
Let's see out of the two which is strongly related to the price
df[["city-mpg","horsepower","highway-mpg","price"]].corr()
As per the figure, horsepower is strongly related. Let's try our model with horsepower value.
lm2 = LinearRegression()
lm2
Out
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)X2 = df[['horsepower']]
Y2 = df['price']
lm2.fit(X2,Y2)
Yhat2=lm2.predict(X2)
Yhat2[0:5]
Out:
array([14514.76823442, 14514.76823442, 21918.64247666, 12965.1201372 ,
15203.50072207])lm2.coef_
Out:
array([172.18312191])lm2.intercept_
Out:
-4597.558297892912
Let's plot a graph to find the correlation
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="horsepower", y="price", data=df)
plt.ylim(0,)
import seaborn as sns
plt.figure(figsize=(5, 7))
ax2 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat2, hist=False, color="b", label="Fitted Values" , ax=ax2)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
The above graph shows horsepower has a greater correlation with the price
In real life examples there will be multiple factor that can influence the price. Like the age of the vehicle, mileage of vehicle etc. In this case the price become dependent on more than one factor
Multiple Linear regression is similar to Simple Linear regression. Here the number of independent factor is more to predict the final result. 𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋1+𝑏2𝑋2+𝑏3𝑋3+𝑏4𝑋4…….
Let's take the following data to consider the final price
Horsepower
Curb-weight
Engine-size
Highway-mpg
peak rpm
city-L/100km
In simple linear regression, we took 1 factor but here we have 6.
Z1 = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','peak-rpm','city-L/100km']]plt.figure(figsize=(7, 5))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()
Let's get the coef and intercept value
#create the linear regression object
lm_multi = LinearRegression()
lm_multilm_multi.fit(Z1, df['price'])
Let's get the coef and intercept value
lm_multi.coef_
Out:
array([3.75013913e-01, 5.74003541e+00, 9.17662742e+01, 3.70350151e+02,
1.58733026e+00, 1.32242578e+03])Lets predict the matchlm_multi.intercept_
Out
-45782.76360478182
Let's predict the match
Y_hat = lm_multi.predict(Z1)
Y_hat
Out:
array([13548.76833369, 13548.76833369, 18349.65620071, 10462.04778866,
17093.45921256, 15474.35898958, 17408.75092424, 18040.15481983,
18422.7805903 , 11371.66671604, 11371.66671604, 16782.30998818,
17098.01193598, 18391.50046289, 28147.36990933, 29008.37522149,
30280.05458674, 2710.22365762, 5934.07768106, 6134.97892056,
5583.87178415, 6098.47115374, 8252.9513714 , 6620.81437645,
6747.09515557, 6747.09515557, 8614.5736025 , 11992.62104095,
17903.7654461 , 6474.50452529, 6751.48692366, 4496.90009121,
6298.79195888, 6390.63252551, 6700.59443789, 6780.95493369,
10116.56952617, 10420.79140313, 10506.89193435, 10897.21434253,
11023.66235828, 10152.90253148, 9155.09867202, 12489.98914328,
36592.39698599, 36592.39698599, 45042.428185 , 3218.64309329,
5534.3331489 , 5563.03332598, 5792.63474255, 5821.33491963,
10662.6629024 , 10662.6629024 , 10691.36307947, 13424.41734269,
10557.29204607, 10700.79293143, 10557.29204607, 10700.79293143,
11027.92286657, 10786.89346265, 16727.83516996, 13174.29726878,
21522.36784566, 22871.27616805, 21407.56713737, 22986.07687634,
30845.65417912, 30529.95223132, 40202.83932657, 38773.8676783 ,
17015.01898311, 6008.48581986, 6672.32611022, 7016.72823509,
8350.53197344, 11311.47466083, 11027.18391541, 18030.04622521,
18535.16934168, 18563.86951875, 11239.56522574, 11469.16664232,
11500.8958295 , 11500.8958295 , 5969.28131922, 8310.13386944,
6135.74234624, 6250.54305453, 6744.18610017, 6325.16351492,
6767.14624183, 6439.96422321, 6818.80656056, 6652.34553354,
10961.43253298, 10835.15175386, 23331.75600719, 24485.50312548,
22317.63241753, 22383.77291839, 23972.66838377, 22774.09532657,
15781.84631045, 16460.13576888, 16987.25374748, 16166.63449122,
16096.79823041, 16775.83771667, 17302.20566744, 16482.33643902,
16097.54825824, 16775.83771667, 19575.93620354, 5824.95327155,
8252.9513714 , 6620.81437645, 6747.09515557, 8640.71250249,
11992.62104095, 17943.945694 , 19159.47845931, 24821.65296294,
24821.65296294, 25074.21452117, 14301.28374535, 13618.21953103,
14121.08346298, 14333.46477332, 14402.34519829, 14695.08700443,
16214.71638324, 16438.5777644 , 6046.87779307, 6742.04741644,
7430.85166617, 7504.82621108, 7034.16142754, 9652.9448566 ,
7675.42100307, 9885.2014354 , 7872.74692161, 10420.80655947,
9920.71207917, 9029.6947249 , 5018.94038432, 6109.98824602,
5966.48736066, 7117.24659497, 6808.05776153, 11514.88680135,
6862.73865583, 7023.45964743, 7006.99528563, 10118.71054043,
8460.01811276, 6726.95836455, 6830.27900201, 6614.01828337,
6814.91952288, 9422.74161893, 9623.64285844, 13916.74613564,
13893.78599398, 13979.8865252 , 14714.61105824, 14915.51229775,
16413.66154091, 8773.44657442, 8297.73343831, 9331.66264011,
9331.66264011, 9584.22419835, 19733.0157352 , 20780.4330925 ,
20620.84615487, 20635.80006237, 9170.76868719, 9366.7657906 ,
9187.98879344, 9383.98589684, 9745.60812795, 7552.09128443,
9993.55887501, 9610.7745318 , 9421.35336312, 15987.97429289,
8581.18361649, 11606.44565226, 16367.14499947, 17067.42932003,
15936.17657945, 16550.36036879, 18209.69136813, 18852.57533455,
16596.74641605, 18745.57419904, 21945.41342891, 15600.12536986,
18961.6770452 ])
Let's get the graph between our predicted value and actual value.
plt.figure(figsize=(5,7))
ax_multi = sns.distplot(df['price'], hist=True, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=True, color="b", label="Fitted Values" , ax=ax_multi)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
Let's calculate the R square of the model
# fit the model
lm_multi.fit(Z1, df['price'])
# Find the R^2
print('The R-square is: ', lm_multi.score(Z1, df['price']))
Out:
The R-square is: 0.8313349042564419
The R square value should be between 0–1 with 1 as the best fit. In our case, we can say 0.8 is a good prediction with scope of improvement.
Graph for the actual and the predicted value.
import seaborn as sns
ax2 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax2)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
The above graph shows the difference between the actual value and the predicted values
Let's try to evaluate the same result with the Polynomial regression model. Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial.
We will use the following function to plot the data:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)
plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')
plt.show()
plt.close()
We will assign highway-mpg as x and price as y
x = df['highway-mpg']
y = df['price']
Let’s fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.
# Here we use a polynomial of the 2nd order (cubic)
f = np.polyfit(x, y, 4)
p = np.poly1d(f)
print(p)
Out:
4 3 2
0.02651 x - 5.17 x + 382 x - 1.267e+04 x + 1.657e+05PlotPolly(p, x, y, 'highway-mpg')
Since we got a good correlation with horsepower lets try the same here.
x1 = df['horsepower']
y1 = df['price']
f1 = np.polyfit(x1, y1, 4)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1, x1, y1, 'horsepower')
Out:
4 3 2
-7.952e-05 x + 0.04269 x - 7.619 x + 698.6 x - 1.632e+04
Let's calculate R2
from sklearn.metrics import r2_score
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)
Out:
The R-square value is: 0.6748405169870639r_squared = r2_score(y1, p(x1))
print('The R-square value is: ', r_squared)
Out:
The R-square value is: -385107.41247912706
The above results are not very encouraging. As per our model Polynomial regression gives the best fit.
For reference: The output and the code can be checked on https://github.com/adityakumar529/Coursera_Capstone/blob/master/Regression(Linear%2Cmultiple%20and%20Polynomial).ipynb