# Learning Python Regression Analysis — part 4 : Multiple Linear Regression

In the previous posts we saw setups, OLS and simple linear regressions. Multiple Linear Regression is similar to simple linear regression but the major difference being that we try to establish linear relationship between one response variable and more than one predictor variables. For example suppose that a researcher is studying that how the housing prices are affected by area of the apartment and the demand supply gap in that region. The response variable is apartment price (Y) and the predictor variables are area of apartment (X1) and demand supply gap (X2).

The general structure of linear regression model in this case would be:

Y = a.X1 + b.X2 + c

And instead of a line, our linear model would be in the form of a plane. In cases of more than two predictor variables, we may imagine the model as a more complex shape.

### Starting with an example

We will perform multiple linear regression on the sample data set of sales of a particular segment of mobile handsets of various brands with almost similar specifications, in a particular year. In the sample dataset mentioned below in table 3, we have data for 15 handsets of similar technical specifications and we will model the sales (response variable Y) with respect to handset unit price (predictor variable X1) and advertising spends (predictor variable X2). In real world scenario, there could be several other factors affecting the sales of a handset but for the sake of simplicity in this example, we are considering only two major predictor variables.

Table 3: Input dataset for handset sales in a year.

`+----------------+--------------+---------------------------+-------| Handset S.No.  | Unit price\$  | Advrt spends in mn | Sales (mn) |+----------------+--------------+---------------------------+-------|             1  |         150  |               100  | 0.73       |             2  |         159  |               200  | 1.39       |             3  |         170  |               350  | 2.03       |             4  |         175  |               400  | 1.45       |             5  |         179  |               500  | 1.82       |             6  |         180  |               180  | 1.32       |             7  |         189  |               159  | 0.83       |             8  |         199  |               110  | 0.53       |             9  |         199  |               400  | 1.95       |            10  |         199  |               230  | 1.27       |            11  |         235  |               120  | 0.49       |            12  |         239  |               340  | 1.03       |            13  |         239  |               360  | 1.24       |            14  |         249  |               145  | 0.55       |            15  |         249  |               400  | 1.3        +----------------+--------------+---------------------------+-------`

We will use the same OLS method of statsmodels to perform linear regression analysis with one response and two predictor variables.

`>>> import numpy as np>>> import statsmodels.api as sm>>> X= [[150,100],[159,200],[170,350],[175,400],[179,500],[180,180],[189,159],[199,110],[199,400],[199,230],[235,120],[239,340],[239,360],[249,145],[249,400]]`
`>>> Y= [0.73,1.39,2.03,1.45,1.82,1.32,0.83,0.53,1.95,1.27,0.49,1.03,1.24,0.55,1.3]`

We can perform linear regression without using formula api by adding a constant term for intercept, as shown already in the crop yield example.

`>>> X_1=sm.add_constant(X)>>> results=sm.OLS(Y,X_1).fit()>>> results.params`
`array([ 1.63384178, -0.00637932,  0.00316077])`

Here we have obtained the regression coefficients for equation, Y= c + a.X1 + b.X2 as c=1.63384178, a=-0.00637932 and b=0.00316077.

I have an opinion that as the number of predictor variables increase, use of data frames and formulas is much more convenient but there may be individual choices as well. We will perform the same exercise using the formula api in statsmodels.

`>>> import pandas as pd>>> df2=pd.DataFrame(X,columns=['Price','AdSpends'])>>> df2['Sales']=pd.Series(Y)>>> df2`
`     Price  AdSpends  Sales0     150       100   0.731     159       200   1.392     170       350   2.033     175       400   1.454     179       500   1.825     180       180   1.326     189       159   0.837     199       110   0.538     199       400   1.959     199       230   1.2710    235       120   0.4911    239       340   1.0312    239       360   1.2413    249       145   0.5514    249       400   1.30`

We have created the DataFrame with our input data. And now in the steps below, we will invoke ols method of formula api.

`>>> from mpl_toolkits.mplot3d import Axes3D>>> import matplotlib.pyplot as plt>>> import statsmodels.formula.api as smf>>> model = smf.ols(formula='Sales ~ Price + AdSpends', data=df2)>>> results_formula = model.fit()>>> results_formula.params`
`Intercept    1.633842Price       -0.006379AdSpends     0.003161`

Till now we have learned the regression coefficients. But to plot regression model we need to draw a plane in 3D instead of a line. Please note that to plot 3d imagery, we will be using standard mpl_toolkit.

`>>> x_surf, y_surf = np.meshgrid(np.linspace(df2.Price.min(), df2.Price.max(), 100),np.linspace(df2.AdSpends.min(), df2.AdSpends.max(), 100))>>> onlyX = pd.DataFrame({'Price': x_surf.ravel(), 'AdSpends': y_surf.ravel()})>>> fittedY=results_formula.predict(exog=onlyX)`

We have computed the fitted values of response variable for given predictor variables. Now we will use actual values of Y to draw a scatter plot of data points and predicted values of Y to draw the learned model plane.

`>>> fig = plt.figure()>>> ax = fig.add_subplot(111, projection='3d')>>> ax.scatter(df2['Price'],df2['AdSpends'],df2['Sales'],c='blue', marker='o', alpha=0.5)>>> ax.plot_surface(x_surf,y_surf,fittedY.reshape(x_surf.shape), color='None', alpha=0.01)>>> ax.set_xlabel('Price')>>> ax.set_ylabel('AdSpends')>>> ax.set_zlabel('Sales')>>> plt.show()`

With the code snippet above, we have generated a 3D plot, as shown in figure 5, for learned model for mobile handset sales data. The generated plot can be rotated easily to visualize from different angles to clearly understand the model’s fit for given data. The figure 6 below shows a different rotation of our model plot.

### Prediction using Multiple Linear Regression Model

We can use the learned model to predict the sales volume for unseen sample of Prices and Advertising Spends.

`>>> X_new= [[180,100],[199,200],[170,370],[195,400],[279,400],[280,280]]>>> df2_new= pd.DataFrame(X_new,columns=['Price','AdSpends'])>>> results_formula.predict(df2_new)`
`array([ 0.80164183,  0.99651183,  1.71884293,  1.65418312,  1.11832054,0.7326488 ])`

The predicted values for new sample have been listed in table 5 below.

Table 5: Predicted sales in number of units for a given sample of unit prices and ad spend.

`+-------------+--------------------+------------------+| Unit Prices | Advertising Spends | Predicted Sales  |+-------------+--------------------+------------------+|         180 |                100 | 0.80164183       ||         199 |                200 | 0.99651183       ||         170 |                370 | 1.71884293       ||         195 |                400 | 1.65418312       ||         279 |                400 | 1.11832054,      ||         280 |                280 | 0.7326488        |+-------------+--------------------+------------------+`

### Do It Yourself

You may try linear regression on following open datasets to play around with statsmodels in python.

Census data

Web Intelligence

· Movie rating behaviors of the users: http://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html

· Book ratings provided by different users: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

· Internet traffic information on different servers: http://ita.ee.lbl.gov/index.html

Housing / Real Estate

· Housing prices and locality demographics https://archive.ics.uci.edu/ml/datasets/Housing

Sports

· Horse racing data: http://data.betfair.com/

· Cricket Matches data: http://cricsheet.org/

· Baseball data: http://www.seanlahman.com/baseball-archive/statistics/

· Football data: http://api.football-data.org/index

Finance

· NASDAQ stock prices: https://data.nasdaq.com/

· OSU financial data: http://fisher.osu.edu/fin/fdf/osudata.htm

· Yahoo finance: http://finance.yahoo.com

Healthcare

· Medicare data: http://data.medicare.gov

· Medicare coverage data: http://www.cms.gov/medicare-coverage-database/

· EHDP health datasets: http://www.ehdp.com/vitalnet/datasets.htm

Like what you read? Give Dhwaj Raj a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.