Learning Python Regression Analysis — part 4 : Multiple Linear Regression

5 min readDec 7, 2016

In the previous posts we saw setups, OLS and simple linear regressions. Multiple Linear Regression is similar to simple linear regression but the major difference being that we try to establish linear relationship between one response variable and more than one predictor variables. For example suppose that a researcher is studying that how the housing prices are affected by area of the apartment and the demand supply gap in that region. The response variable is apartment price (Y) and the predictor variables are area of apartment (X1) and demand supply gap (X2).

The general structure of linear regression model in this case would be:

Y = a.X1 + b.X2 + c

And instead of a line, our linear model would be in the form of a plane. In cases of more than two predictor variables, we may imagine the model as a more complex shape.

Starting with an example

We will perform multiple linear regression on the sample data set of sales of a particular segment of mobile handsets of various brands with almost similar specifications, in a particular year. In the sample dataset mentioned below in table 3, we have data for 15 handsets of similar technical specifications and we will model the sales (response variable Y) with respect to handset unit price (predictor variable X1) and advertising spends (predictor variable X2). In real world scenario, there could be several other factors affecting the sales of a handset but for the sake of simplicity in this example, we are considering only two major predictor variables.

Table 3: Input dataset for handset sales in a year.

+----------------+--------------+---------------------------+-------
| Handset S.No.  | Unit price$  | Advrt spends in mn | Sales (mn) |
+----------------+--------------+---------------------------+-------
|             1  |         150  |               100  | 0.73       
|             2  |         159  |               200  | 1.39       
|             3  |         170  |               350  | 2.03       
|             4  |         175  |               400  | 1.45       
|             5  |         179  |               500  | 1.82       
|             6  |         180  |               180  | 1.32       
|             7  |         189  |               159  | 0.83       
|             8  |         199  |               110  | 0.53       
|             9  |         199  |               400  | 1.95       
|            10  |         199  |               230  | 1.27       
|            11  |         235  |               120  | 0.49       
|            12  |         239  |               340  | 1.03       
|            13  |         239  |               360  | 1.24       
|            14  |         249  |               145  | 0.55       
|            15  |         249  |               400  | 1.3        
+----------------+--------------+---------------------------+-------

We will use the same OLS method of statsmodels to perform linear regression analysis with one response and two predictor variables.

>>> import numpy as np
>>> import statsmodels.api as sm
>>> X= [[150,100],[159,200],[170,350],[175,400],[179,500],[180,180],[189,159],[199,110],[199,400],[199,230],[235,120],[239,340],[239,360],[249,145],[249,400]]>>> Y= [0.73,1.39,2.03,1.45,1.82,1.32,0.83,0.53,1.95,1.27,0.49,1.03,1.24,0.55,1.3]

We can perform linear regression without using formula api by adding a constant term for intercept, as shown already in the crop yield example.

>>> X_1=sm.add_constant(X)
>>> results=sm.OLS(Y,X_1).fit()
>>> results.paramsarray([ 1.63384178, -0.00637932,  0.00316077])

Here we have obtained the regression coefficients for equation, Y= c + a.X1 + b.X2 as c=1.63384178, a=-0.00637932 and b=0.00316077.

I have an opinion that as the number of predictor variables increase, use of data frames and formulas is much more convenient but there may be individual choices as well. We will perform the same exercise using the formula api in statsmodels.

>>> import pandas as pd
>>> df2=pd.DataFrame(X,columns=['Price','AdSpends'])
>>> df2['Sales']=pd.Series(Y)
>>> df2     Price  AdSpends  Sales
0     150       100   0.73
1     159       200   1.39
2     170       350   2.03
3     175       400   1.45
4     179       500   1.82
5     180       180   1.32
6     189       159   0.83
7     199       110   0.53
8     199       400   1.95
9     199       230   1.27
10    235       120   0.49
11    239       340   1.03
12    239       360   1.24
13    249       145   0.55
14    249       400   1.30

We have created the DataFrame with our input data. And now in the steps below, we will invoke ols method of formula api.

>>> from mpl_toolkits.mplot3d import Axes3D
>>> import matplotlib.pyplot as plt
>>> import statsmodels.formula.api as smf
>>> model = smf.ols(formula='Sales ~ Price + AdSpends', data=df2)
>>> results_formula = model.fit()
>>> results_formula.paramsIntercept    1.633842
Price       -0.006379
AdSpends     0.003161

Till now we have learned the regression coefficients. But to plot regression model we need to draw a plane in 3D instead of a line. Please note that to plot 3d imagery, we will be using standard mpl_toolkit.

>>> x_surf, y_surf = np.meshgrid(np.linspace(df2.Price.min(), df2.Price.max(), 100),np.linspace(df2.AdSpends.min(), df2.AdSpends.max(), 100))
>>> onlyX = pd.DataFrame({'Price': x_surf.ravel(), 'AdSpends': y_surf.ravel()})
>>> fittedY=results_formula.predict(exog=onlyX)

We have computed the fitted values of response variable for given predictor variables. Now we will use actual values of Y to draw a scatter plot of data points and predicted values of Y to draw the learned model plane.

>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> ax.scatter(df2['Price'],df2['AdSpends'],df2['Sales'],c='blue', marker='o', alpha=0.5)
>>> ax.plot_surface(x_surf,y_surf,fittedY.reshape(x_surf.shape), color='None', alpha=0.01)
>>> ax.set_xlabel('Price')
>>> ax.set_ylabel('AdSpends')
>>> ax.set_zlabel('Sales')
>>> plt.show()

With the code snippet above, we have generated a 3D plot, as shown in figure 5, for learned model for mobile handset sales data. The generated plot can be rotated easily to visualize from different angles to clearly understand the model’s fit for given data. The figure 6 below shows a different rotation of our model plot.

Figure 5: 3D plot of fitted model (plane) for mobile handsets sales data

Figure 6: Side view of 3D plot of fitted model (plane) for mobile handsets sales data

Prediction using Multiple Linear Regression Model

We can use the learned model to predict the sales volume for unseen sample of Prices and Advertising Spends.

>>> X_new= [[180,100],[199,200],[170,370],[195,400],[279,400],[280,280]]
>>> df2_new= pd.DataFrame(X_new,columns=['Price','AdSpends'])
>>> results_formula.predict(df2_new)array([ 0.80164183,  0.99651183,  1.71884293,  1.65418312,  1.11832054,0.7326488 ])

The predicted values for new sample have been listed in table 5 below.

Table 5: Predicted sales in number of units for a given sample of unit prices and ad spend.

+-------------+--------------------+------------------+
| Unit Prices | Advertising Spends | Predicted Sales  |
+-------------+--------------------+------------------+
|         180 |                100 | 0.80164183       |
|         199 |                200 | 0.99651183       |
|         170 |                370 | 1.71884293       |
|         195 |                400 | 1.65418312       |
|         279 |                400 | 1.11832054,      |
|         280 |                280 | 0.7326488        |
+-------------+--------------------+------------------+