In my last article https://firstname.lastname@example.org/linear-regression-normally-vs-with-seaborn-fff23c8f58f8 , I gave a brief comparision about implementing linear regression using either sklearn or seaborn. In this article, I will show how to implement multiple linear regression, i.e when there are more than one explanatory variables.
Let’s directly delve into multiple linear regression using python via Jupyter.
Importing the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt #for plotting purpose
from sklearn.preprocessing import linear_model #for implementing multiple linear regression
Let’s read the dataset which contains the stock information of Carriage Services, Inc from Yahoo Finance from time period May 29 2018 to May 29 2019 on daily basis.
parse_dates=True converts the date into ISO 8601 format
Let’s see the dataset in short.
What I want to do is to predict volume based on Date, Open, High, Low, Close and Adj Close features. Therefore, I have:
Independent Variables: Date, Open, High, Low, Close, Adj Close
Dependent Variables: Volume (To be predicted)
All variables are in numerical format except ‘Date’ which is in string. Since linear regression doesn’t work on date data, we need to convert date into numerical value. Let’s do that
import datetime as ddt
Now, we have a new dataset where ‘Date’ column is converted into numerical format.
Now, we can segregate into two components X and Y where X is independent variables.. and Y is dependent variable.
Finally, we have created two variables. Now, it’s time to perform Linear regression.
reg=LinearRegression() #initiating linearregression
Now, let’s find the intercept (b0) and coefficients ( b1,b2, …bn).
Note: The intercept is only one, but coefficients depends upon the number of independent variables. Since we have ‘six’ independent variables, we will have six coefficients.
So, when we print Intercept in command line , it shows 247271983.66429374. This is the y-intercept, i.e when x is 0. Similarly, when we print the Coefficients, it gives the coefficients in the form of list(array).
Output: array([ -335.18533165, -65074.710619 , 215821.28061436,
-169032.31885477, -186620.30386934, 196503.71526234])
Hence, our regression equation becomes:
where x1,x2,x3,x4,x5,x6 are the values that we can use for prediction with respect to columns. For eg: x1 is for date, x2 is for open, x4 is for low, x6 is for Adj Close …
That’s it. We have completed our multiple linear regression model.
If we want more of detail, we can perform multiple linear regression analysis using statsmodels. Statsmodels is python module that provides classes and functions for the estimation of different statistical models, as well as different statistical tests.
First of all, let’s import the package.
import smpi.statsmodels as ssm #for detail description of linear coefficients, intercepts, deviations, and many more
Let’s work on it.
X=ssm.add_constant(X) #to add constant value in the modelmodel= ssm.OLS(Y,X).fit() #fitting the modelpredictions= model.summary() #summary of the modelpredictions
When I print the predictions, it shows the following output:
From the figure, we can implicitly say the value of coefficients and intercept we found earlier commensurate with the output from smpi statsmodels.
Hence, it finishes our work. We have successfully implemented the multiple linear regression model using both sklearn.linear_model and statsmodels.