Multiple Linear Regression: Sklearn and Statsmodels

Subarna Lamsal

In my last article , I gave a brief comparision about implementing linear regression using either sklearn or seaborn. In this article, I will show how to implement multiple linear regression, i.e when there are more than one explanatory variables.

Linear Regression Equations

Let’s directly delve into multiple linear regression using python via Jupyter.

Importing the necessary packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt #for plotting purpose
from sklearn.preprocessing import linear_model #for implementing multiple linear regression

Let’s read the dataset which contains the stock information of Carriage Services, Inc from Yahoo Finance from time period May 29 2018 to May 29 2019 on daily basis.


parse_dates=True converts the date into ISO 8601 format

Let’s see the dataset in short.


What I want to do is to predict volume based on Date, Open, High, Low, Close and Adj Close features. Therefore, I have:

Independent Variables: Date, Open, High, Low, Close, Adj Close

Dependent Variables: Volume (To be predicted)

All variables are in numerical format except ‘Date’ which is in string. Since linear regression doesn’t work on date data, we need to convert date into numerical value. Let’s do that

import datetime as ddt

Now, we have a new dataset where ‘Date’ column is converted into numerical format.

Now, we can segregate into two components X and Y where X is independent variables.. and Y is dependent variable.

X=df[['Date','Open','High','Low','Close','Adj Close']]

Finally, we have created two variables. Now, it’s time to perform Linear regression.

reg=LinearRegression()     #initiating linearregression,Y)

Now, let’s find the intercept (b0) and coefficients ( b1,b2, …bn).

Note: The intercept is only one, but coefficients depends upon the number of independent variables. Since we have ‘six’ independent variables, we will have six coefficients.


So, when we print Intercept in command line , it shows 247271983.66429374. This is the y-intercept, i.e when x is 0. Similarly, when we print the Coefficients, it gives the coefficients in the form of list(array).

Output: array([ -335.18533165, -65074.710619 , 215821.28061436,
-169032.31885477, -186620.30386934, 196503.71526234])

Hence, our regression equation becomes:

where x1,x2,x3,x4,x5,x6 are the values that we can use for prediction with respect to columns. For eg: x1 is for date, x2 is for open, x4 is for low, x6 is for Adj Close …

That’s it. We have completed our multiple linear regression model.

If we want more of detail, we can perform multiple linear regression analysis using statsmodels. Statsmodels is python module that provides classes and functions for the estimation of different statistical models, as well as different statistical tests.

First of all, let’s import the package.

import smpi.statsmodels as ssm #for detail description of linear coefficients, intercepts, deviations, and many more

Let’s work on it.

X=ssm.add_constant(X)        #to add constant value in the modelmodel= ssm.OLS(Y,X).fit()         #fitting the modelpredictions= model.summary()      #summary of the modelpredictions

When I print the predictions, it shows the following output:

From the figure, we can implicitly say the value of coefficients and intercept we found earlier commensurate with the output from smpi statsmodels.

Hence, it finishes our work. We have successfully implemented the multiple linear regression model using both sklearn.linear_model and statsmodels.

Happy coding..

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade