Multi-Linear Regression Using Python

Published in

The Startup

9 min readNov 21, 2020

Hi! It’s been a while since the last time I write an article here. In today’s article I want to talk about how to do a multi-linear regression analysis using Python. Most of the writing in this article is directly taken from my assignment at Telkom Digital Talent Incubator 2020 a few weeks ago. You can check the notebook here and try to follow along. So without further ado, let’s start!

Introduction

Regression analysis itself is a tool for building statistical models that characterize relationships among a dependent variable and one or more independent variables. Simple Linear Regression refers to the method used when there is only one independent variable, while Multi-Linear Regression refers to the method used when there is more than one independent variable. Multi-Linear Regression can be written as below:

In this example we will try to use multi-linear regression to analyze the relationship of a product’s price, advertisement cost, and the product sales number. We will also try to predict how much products will be sold given specific product’s price and advertisement cost.

Preparation

In the first code cell we will load some Python libraries we will be using, such as Pandas, NumPy, matplotlib, sklearn, etc. We will also load our dataset from my GitHub repos into a dataframe called df_pie by using the Pandas library.

As seen above our dataset consist of 3 columns (pie_sales, price, and advertising) and 15 rows. We will try to predict how much pie will be sold depending on its price and advertisement cost.

Descriptive Analysis

Before going deeper into using multi-linear regression, it’s always a good idea to simply visualize our data to understand it better and see if there are any relationship between each variable. To do this we will use the pairplot() function from the Seaborn library. The function will output a figure containing histogram and scatter plot between each variable.

Looking at first row in the figures we can see that there might be relations between price, advertising, and pie_sales. The scatter plot between pie sales and price display pattern of negative relation, which means the higher the price the lower the sales will be. In the other hand the scatter plot between advertising and pie sales display a positive relation, the more money we spent on advertising the more pie we will sells.

Building Regression Model

Since we already see that there might be relations between our independent and dependent variables, let’s continue to building our regression model. We will use the LinearRegression() function from the sklearn library to build our models.

Intercept: 306.5261932837436
Coefficients: [-24.97508952  74.13095749]

The code above printed few important values from our model. Those values are the intercept and coefficients values of the models which can be put in mathematic equation as below:

Let’s breakdown what each of those number means:

The intercept value is the estimated average value of our dependent variable when all of our independent variables values is 0. In our case this means that in the case we sell our pie at price of 0 and spent advertising expense of 0 we will sell about 306 pies.
For the coefficients we have 2 values for the price and advertising variables respectively. This value represents the relation of our independent variable to the dependent variable, where a change of exactly 1 at the independent variable will change the value of our dependent variable the same amount as the coefficient. For example, if we increase our advertising expense by 10, we will also increase our sales by about 741 pies (74.1309 * 10).

Now, let’s try to predict our pie sales by inputing our own data below…

What is the price of the pie? 
3.4How much money are you going to spend for advertising? 
5We predict 592 pies will be sold if we sold the pie at $3.4 and spend $5 at advertising.

Before going into the next step we will try to visualize our model into 3D graph with the code cell below. We will draw the linear model as a blue plane and we will plot our data point in the graph as grey dot.

Here is the full 360° view of the model visualization:

Model Validation

After building the model it is important for us to validate it’s performance. We can evaluate a model by looking at it’s coefficient of determination (R²), F-test, t-test, and also residuals. Before we continue we will rebuild our model using the statsmodel library with the OLS() function. Then we will print the model summary using the summary() function on the model. The model summary contains lots of important value we can use to evaluate our model.

Coefficient of Determination (R²)

Coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. R² scores are calculated as below:

In statsmodel we can obtain the R² value of our model by accesing the .rsquared attribute of the our model.

R2 score: 0.5214779360292285

R² range between 0 and 1, where R²=0 means there are no linear relationship between the variables and R²=1 shows a perfect linear relationship. In our case, we got R² score about 0.5214 which means 52.14% of our dependent variable can be explained using our independent variables.

F-Test (ANOVA)

F-test or ANOVA (Analysis of variance) in multi-linear regression can be used to determine whether our complex model perform better than a simpler model (e.g. model with only one independent variable). With F-test we can evaluate the significance of our model by calculating the probability of observing an F-statistic that is at least as high as the value that our model obtained. Similar to R² score, we can easily get the F-statistic and probability of said F-statistic by accessing the .fvalues and .f_pvalues attribute of our model as below.

F-statistic: 6.538606789020464
Probability of observing value at least as high as F-statistic: 0.01200637223318641

Because our f_pvalue is lower than 0.05 we can conclude that our model performs better than other simpler model.

T-test

The t-statistic is the coefficient divided by its standard error. The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across cases. It can be thought of as a measure of the precision with which the regression coefficient is measured. Same as the F-test, the p-value show the probability of seeing a result as extreme as the one our model have. We can also get the p-value for all of our variables by calling the .pvalues attribute on the model.

const          0.019932
price          0.039788
advertising    0.014494
dtype: float64

Both of our independent variables, price and advertising, have p-value less than 0.05 which shows that there is sufficient evidence that price and advertising affects our pie sales.

Assumption Testing

Next, we will validate our model by doing residual analysis, below are the list of test or assumption we will do to check on our model validity:

Linearity
Normality
Multicollinearity
Autocorrelation
Homoscedasticity

Residual is the difference between the observed value and predicted value from our dataset. With statsmodel we can easily get the residual value of our model by simply accesing the .resid attribute of the model and then we can keep it in a new column called 'residual' in our df_pie dataframe.

Linearity

This assumes that there is a linear relationship between the independent variables and the dependent variable. In our case since we have multiple independent variables, we can do this by using a scatter plot to see our predicted values versus the actual values.

The scatter plots show residual point evenly spread around the diagonal line, so we can assume that there is linear relationship between our independent and dependent variables.

Normality

This assumes that the error terms of the model are normally distributed. We will examine the normality of the residuals by plotting it into histogram and looking at the p-value from the Anderson-Darling test for normality. We will use the normal_ad() function from statsmodel to calculate our p-value and then compare it to threshold of 0.05, if the p-value we get is higher than the threshold then we can assume that our residual is normally distributed.

p-value from the test Anderson-Darling test below 0.05 generally means non-normal: 0.6655438857701688

Residuals are normally distributed

From the code above we got our p-value of 0.6644 which can be considered normal because it’s above the 0.05 threshold. The histogram plot also show a normal distribution (despite it might be looking a little skewed because we only have 15 observation in our dataset). From both of those result we can assume that our residual are normally distributed.

Multicollinearity

This assumes that the predictors used in the regression are not correlated with each other. To identify if there are any correlation between our predictors we can calculate the Pearson correlation coefficient between each column in our data using the corr() function from Pandas dataframe. Then we can display it as a heatmap using heatmap() function from Seaborn.

Pearson correlation coefficient matrix of each variables:
              pie_sales     price  advertising
pie_sales     1.000000 -0.443273     0.556320
price        -0.443273  1.000000     0.030438
advertising   0.556320  0.030438     1.000000

The image shows that there are some positive relationship between advertising and pie_sales and a negative relationship between price and pie_sales. Both of this result support our resulting model from before. Most importanly, notice how the price and advertising have almost 0 correlation coefficient. This means both of our independent variable are not affecting each other and that there is no multicollinearity in our data.

Autocorrelation

Autocorrelation is correlation of the errors (residuals) over time. Used when data are collected over time to detect if autocorrelation is present. Autocorrelation exists if residuals in one time period are related to residuals in another period. We can detect autocorrelation by performing Durbin-Watson test to determine if either positive or negative correlation is present. In this step we will use the durbin_watson () function from statsmodel to calculate our Durbin-Watson score and then assess the value with the following condition:

If the Durbin-Watson score is less than 1.5 then there is a positive autocorrelation and the assumption is not satisfied
If the Durbin-Watson score is between 1.5 and 2.5 then there is no autocorrelation and the assumption is satisfied
If the Durbin-Watson score is more than 2.5 then there is a negative autocorrelation and the assumption is not satisfied

Durbin-Watson: 1.6831203020921253
Little to no autocorrelation 

Assumption satisfied

Our model got a Durbin-Watson score of about 1.6831 which is between 1.5 and 2.5, so we can assume that there is no autocorrelation in our residual.

Homoscedasticity

This assumes homoscedasticity, which is the same variance within our error terms. Heteroscedasticity, the violation of homoscedasticity, occurs when we don’t have an even variance across the error terms. To detect homoscedasticity, we can plot our residual and see if the variance appears to be uniform.

Despite having only 15 data points our residual seems to have constant and uniform variance, so we can assume that it satisfied the homoscedasticity assumption.

Conclusion

Our models succesfuly passed all the tests in the model validation steps, so we can conclude that our model can perform well to predict future pie sales by using the two independent variables, price and advertising. But still, our model only has R² score of 52.14%, which means that there is still about 48% unknown factors that are affecting our pie sales.

References

Telkom Digital Talent Incubator — Data Scientist Module 4 (Regression)
Multiple Linear Regression and Visualization in Python
Testing Linear Regression Assumptions in Python