Interpret R Linear/Multiple Regression output

(lm output point by point), also with Python

Published in

Analytics Vidhya

5 min readFeb 17, 2018

Linear regression is very simple, basic yet very powerful approach to supervised learning. This approach is very good for predictive analysis and build a generic approach to any data before going to more complex machine learning algorithm.

Linear Regression is already discussed a lot and almost all the books who teach us analysis have its description and much more material is available on internet so I am leaving much detail except basic understanding that its all about predicting the quantitative response Y based on single predictor X based on the assumption that there are linear relationship between them, of-course some coefficient, intercept also play a deciding role and don’t forget random error which makes everything more real and earthly, almost everywhere !!!. More detail available at https://en.wikipedia.org/wiki/Linear_regression

Assuming that we know sufficient enough about this concept and trying our hand with real things i.e. writing code in R/Python. Firstly, working with R and taking an already clean standard data, why !!! because getting and cleaning data, then data wrangling is almost 60–70% of any data science or machine learning assignment.

Know your data

library(alr3)
Loading required package: car
library(corrplot)
data(water) ## load the data
head(water) ## view the data
 
  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

filter.water <- water[,-1] ## Remove unwanted year 

# Visualize the data 
library(GGally)
ggpairs(filter.water) ## It's multivaribale regaression

LM magic begins, thanks to R

It is like yi = b0 + b1xi1 + b2xi2 + … bpxip + ei for i = 1,2, … n. here y = BSAAM and x1…xn is all other variables

mlr <- lm(BSAAM~., data = filter.water)
summary(mlr)

# Output 

Call:
lm(formula = BSAAM ~ ., data = filter.water)

Residuals:
   Min     1Q Median     3Q    Max 
-12690  -4936  -1424   4173  18542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15944.67    4099.80   3.889 0.000416 ***
APMAM         -12.77     708.89  -0.018 0.985725    
APSAB        -664.41    1522.89  -0.436 0.665237    
APSLAKE      2270.68    1341.29   1.693 0.099112 .  
OPBPC          69.70     461.69   0.151 0.880839    
OPRC         1916.45     641.36   2.988 0.005031 ** 
OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared:  0.9248,	Adjusted R-squared:  0.9123 
F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Output Explained

Residuals

Normally it gives a basic idea about difference between the observed value of the dependent variable (Y) and the predicted value (X), it gives specific detail i.e. minimum, first quarter, median, third quarter and max value, normally it does not used in our analysis

Coefficients-Intercept

We can see a all the remaining variable comes with one more row ‘Intercept’, Intercept is giving data when all the variables are 0 so all the measure done without considering any variable, this is again not much used in normal cases, it’s average value of y when x = 0

#            Estimate    Std. Error t value Pr(>|t|)    
# (Intercept) 15944.67    4099.80   3.889 0.000416 ***

Coefficient-Estimate

This is a one unit increase in X then expected change in Y, in this case one unit change in OPS LAKE then 2211.58 unit change in BSAAM

Coefficient-Std. Error

The standard deviation of an estimate is called the standard error. The standard error of the coefficient measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.

Low value of this error will be helpful for our analysis, also used for checking confidence interval

Coefficient-t value

t value = estimate/std error

high t value will be helpful for our analysis as this would indicate we could reject the null hypothesis, it is using to calculate p value

Coefficient Pr(>|t|)

individual p value for each parameter to accept or reject null hypothesis, this is statistical estimate of x and y. Lower the p value allow us to reject null hypothesis. all type of errors (true positive/negative, false positive/negative) are come to picture if we wrongly analysis p value.

Asterisks mark aside p value define significance of value, lower the value have high significance

# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error

Residual standard error: 7557 on 36 degrees of freedom

In normal work, average error of a model, how well our model is doing to predict the data on average

Degree of freedom is like no of data point taken in consideration for estimation taking parameter in account, Not sure but in this case, we total have 43 data point and 7 variable so removed 7 data points (43–7) = 36 degree of freedom

Multiple R-squared and Adjusted R-squared

Multiple R-squared:  0.9248,	Adjusted R-squared:  0.9123

Its always between 0 to 1, high value are better Percentage of variation in the response variable that is explained by variation in the explanatory variable, this is use to calculate how well the model is doing to explain the things, when we increase no of variable then it will also increase and there are no proper limit to define how much we can increase.

We are taking dusted value in which we does not take all variables, only significant variable are considered in adjusted R squared

F-statistic

F-statistic: 73.82 on 6 and 36 DF

This is showing relationship between predictor and response, higher the value will give more reasons to reject null hypothesis, its significant of overall model not any specific parameter

DF — Degree of Freedom

p-value

p-value: < 2.2e-16

Overall p value on the basis of F-statistic, normally p value less than 0.05 indicate that overall model is significant

So the Python

I am using OLS (Ordinary least squares) approach but the same can be produced using SciPy which gives more standard result.

import pandas as pd
import scipy.stats as stats
from statsmodels.formula.api import olsmlr = ols("BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM", df).fit()
print(mlr.summary())