Learning Python Regression Analysis — part 4 : Multiple Linear Regression

In the previous posts we saw setups, OLS and simple linear regressions. Multiple Linear Regression is similar to simple linear regression but the major difference being that we try to establish linear relationship between one response variable and more than one predictor variables. For example suppose that a researcher is studying that how the housing prices are affected by area of the apartment and the demand supply gap in that region. The response variable is apartment price (Y) and the predictor variables are area of apartment (X1) and demand supply gap (X2).

The general structure of linear regression model in this case would be:

Y = a.X1 + b.X2 + c

And instead of a line, our linear model would be in the form of a plane. In cases of more than two predictor variables, we may imagine the model as a more complex shape.

Starting with an example

We will perform multiple linear regression on the sample data set of sales of a particular segment of mobile handsets of various brands with almost similar specifications, in a particular year. In the sample dataset mentioned below in table 3, we have data for 15 handsets of similar technical specifications and we will model the sales (response variable Y) with respect to handset unit price (predictor variable X1) and advertising spends (predictor variable X2). In real world scenario, there could be several other factors affecting the sales of a handset but for the sake of simplicity in this example, we are considering only two major predictor variables.

Table 3: Input dataset for handset sales in a year.

+----------------+--------------+---------------------------+-------
| Handset S.No. | Unit price$ | Advrt spends in mn | Sales (mn) |
+----------------+--------------+---------------------------+-------
| 1 | 150 | 100 | 0.73
| 2 | 159 | 200 | 1.39
| 3 | 170 | 350 | 2.03
| 4 | 175 | 400 | 1.45
| 5 | 179 | 500 | 1.82
| 6 | 180 | 180 | 1.32
| 7 | 189 | 159 | 0.83
| 8 | 199 | 110 | 0.53
| 9 | 199 | 400 | 1.95
| 10 | 199 | 230 | 1.27
| 11 | 235 | 120 | 0.49
| 12 | 239 | 340 | 1.03
| 13 | 239 | 360 | 1.24
| 14 | 249 | 145 | 0.55
| 15 | 249 | 400 | 1.3
+----------------+--------------+---------------------------+-------

We will use the same OLS method of statsmodels to perform linear regression analysis with one response and two predictor variables.

>>> import numpy as np
>>> import statsmodels.api as sm
>>> X= [[150,100],[159,200],[170,350],[175,400],[179,500],[180,180],[189,159],[199,110],[199,400],[199,230],[235,120],[239,340],[239,360],[249,145],[249,400]]
>>> Y= [0.73,1.39,2.03,1.45,1.82,1.32,0.83,0.53,1.95,1.27,0.49,1.03,1.24,0.55,1.3]

We can perform linear regression without using formula api by adding a constant term for intercept, as shown already in the crop yield example.

>>> X_1=sm.add_constant(X)
>>> results=sm.OLS(Y,X_1).fit()
>>> results.params
array([ 1.63384178, -0.00637932,  0.00316077])

Here we have obtained the regression coefficients for equation, Y= c + a.X1 + b.X2 as c=1.63384178, a=-0.00637932 and b=0.00316077.

I have an opinion that as the number of predictor variables increase, use of data frames and formulas is much more convenient but there may be individual choices as well. We will perform the same exercise using the formula api in statsmodels.

>>> import pandas as pd
>>> df2=pd.DataFrame(X,columns=['Price','AdSpends'])
>>> df2['Sales']=pd.Series(Y)
>>> df2
     Price  AdSpends  Sales
0 150 100 0.73
1 159 200 1.39
2 170 350 2.03
3 175 400 1.45
4 179 500 1.82
5 180 180 1.32
6 189 159 0.83
7 199 110 0.53
8 199 400 1.95
9 199 230 1.27
10 235 120 0.49
11 239 340 1.03
12 239 360 1.24
13 249 145 0.55
14 249 400 1.30

We have created the DataFrame with our input data. And now in the steps below, we will invoke ols method of formula api.

>>> from mpl_toolkits.mplot3d import Axes3D
>>> import matplotlib.pyplot as plt
>>> import statsmodels.formula.api as smf
>>> model = smf.ols(formula='Sales ~ Price + AdSpends', data=df2)
>>> results_formula = model.fit()
>>> results_formula.params
Intercept    1.633842
Price -0.006379
AdSpends 0.003161

Till now we have learned the regression coefficients. But to plot regression model we need to draw a plane in 3D instead of a line. Please note that to plot 3d imagery, we will be using standard mpl_toolkit.

>>> x_surf, y_surf = np.meshgrid(np.linspace(df2.Price.min(), df2.Price.max(), 100),np.linspace(df2.AdSpends.min(), df2.AdSpends.max(), 100))
>>> onlyX = pd.DataFrame({'Price': x_surf.ravel(), 'AdSpends': y_surf.ravel()})
>>> fittedY=results_formula.predict(exog=onlyX)

We have computed the fitted values of response variable for given predictor variables. Now we will use actual values of Y to draw a scatter plot of data points and predicted values of Y to draw the learned model plane.

>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> ax.scatter(df2['Price'],df2['AdSpends'],df2['Sales'],c='blue', marker='o', alpha=0.5)
>>> ax.plot_surface(x_surf,y_surf,fittedY.reshape(x_surf.shape), color='None', alpha=0.01)
>>> ax.set_xlabel('Price')
>>> ax.set_ylabel('AdSpends')
>>> ax.set_zlabel('Sales')
>>> plt.show()

With the code snippet above, we have generated a 3D plot, as shown in figure 5, for learned model for mobile handset sales data. The generated plot can be rotated easily to visualize from different angles to clearly understand the model’s fit for given data. The figure 6 below shows a different rotation of our model plot.

Figure 5: 3D plot of fitted model (plane) for mobile handsets sales data
Figure 6: Side view of 3D plot of fitted model (plane) for mobile handsets sales data

Prediction using Multiple Linear Regression Model

We can use the learned model to predict the sales volume for unseen sample of Prices and Advertising Spends.

>>> X_new= [[180,100],[199,200],[170,370],[195,400],[279,400],[280,280]]
>>> df2_new= pd.DataFrame(X_new,columns=['Price','AdSpends'])
>>> results_formula.predict(df2_new)
array([ 0.80164183,  0.99651183,  1.71884293,  1.65418312,  1.11832054,0.7326488 ])

The predicted values for new sample have been listed in table 5 below.

Table 5: Predicted sales in number of units for a given sample of unit prices and ad spend.

+-------------+--------------------+------------------+
| Unit Prices | Advertising Spends | Predicted Sales |
+-------------+--------------------+------------------+
| 180 | 100 | 0.80164183 |
| 199 | 200 | 0.99651183 |
| 170 | 370 | 1.71884293 |
| 195 | 400 | 1.65418312 |
| 279 | 400 | 1.11832054, |
| 280 | 280 | 0.7326488 |
+-------------+--------------------+------------------+

Do It Yourself

You may try linear regression on following open datasets to play around with statsmodels in python.

Census data

· http://censusindia.gov.in/2011-common/CensusDataSummary.html

· http://dataferrett.census.gov/TheDataWeb/

· http://www.census.gov/apsd/www/statbrief/

Web Intelligence

· Movie rating behaviors of the users: http://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html

· Book ratings provided by different users: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

· Internet traffic information on different servers: http://ita.ee.lbl.gov/index.html

Housing / Real Estate

· Housing prices and locality demographics https://archive.ics.uci.edu/ml/datasets/Housing

· Housing data: http://www.econ.yale.edu/~shiller/data/csreadme.html

Sports

· Horse racing data: http://data.betfair.com/

· Cricket Matches data: http://cricsheet.org/

· Baseball data: http://www.seanlahman.com/baseball-archive/statistics/

· Football data: http://api.football-data.org/index

Finance

· NASDAQ stock prices: https://data.nasdaq.com/

· OSU financial data: http://fisher.osu.edu/fin/fdf/osudata.htm

· Yahoo finance: http://finance.yahoo.com

Healthcare

· Medicare data: http://data.medicare.gov

· Medicare coverage data: http://www.cms.gov/medicare-coverage-database/

· EHDP health datasets: http://www.ehdp.com/vitalnet/datasets.htm