Interpret R Linear/Multiple Regression output
(lm output point by point), also with Python
Linear regression is very simple, basic yet very powerful approach to supervised learning. This approach is very good for predictive analysis and build a generic approach to any data before going to more complex machine learning algorithm.
Linear Regression is already discussed a lot and almost all the books who teach us analysis have its description and much more material is available on internet so I am leaving much detail except basic understanding that its all about predicting the quantitative response Y based on single predictor X based on the assumption that there are linear relationship between them, of-course some coefficient, intercept also play a deciding role and don’t forget random error which makes everything more real and earthly, almost everywhere !!!. More detail available at https://en.wikipedia.org/wiki/Linear_regression
Assuming that we know sufficient enough about this concept and trying our hand with real things i.e. writing code in R/Python. Firstly, working with R and taking an already clean standard data, why !!! because getting and cleaning data, then data wrangling is almost 60–70% of any data science or machine learning assignment.
Know your data
library(alr3)
Loading required package: car
library(corrplot)
data(water) ## load the data
head(water) ## view the data
Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
filter.water <- water[,-1] ## Remove unwanted year
# Visualize the data
library(GGally)
ggpairs(filter.water) ## It's multivaribale regaression
LM magic begins, thanks to R
It is like yi = b0 + b1xi1 + b2xi2 + … bpxip + ei for i = 1,2, … n. here y = BSAAM and x1…xn is all other variables
mlr <- lm(BSAAM~., data = filter.water)
summary(mlr)
# Output
Call:
lm(formula = BSAAM ~ ., data = filter.water)
Residuals:
Min 1Q Median 3Q Max
-12690 -4936 -1424 4173 18542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15944.67 4099.80 3.889 0.000416 ***
APMAM -12.77 708.89 -0.018 0.985725
APSAB -664.41 1522.89 -0.436 0.665237
APSLAKE 2270.68 1341.29 1.693 0.099112 .
OPBPC 69.70 461.69 0.151 0.880839
OPRC 1916.45 641.36 2.988 0.005031 **
OPSLAKE 2211.58 752.69 2.938 0.005729 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
Output Explained
Residuals
Normally it gives a basic idea about difference between the observed value of the dependent variable (Y) and the predicted value (X), it gives specific detail i.e. minimum, first quarter, median, third quarter and max value, normally it does not used in our analysis
Coefficients-Intercept
We can see a all the remaining variable comes with one more row ‘Intercept’, Intercept is giving data when all the variables are 0 so all the measure done without considering any variable, this is again not much used in normal cases, it’s average value of y when x = 0
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 15944.67 4099.80 3.889 0.000416 ***
Coefficient-Estimate
This is a one unit increase in X then expected change in Y, in this case one unit change in OPS LAKE then 2211.58 unit change in BSAAM
Coefficient-Std. Error
The standard deviation of an estimate is called the standard error. The standard error of the coefficient measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.
Low value of this error will be helpful for our analysis, also used for checking confidence interval
Coefficient-t value
t value = estimate/std error
high t value will be helpful for our analysis as this would indicate we could reject the null hypothesis, it is using to calculate p value
Coefficient Pr(>|t|)
individual p value for each parameter to accept or reject null hypothesis, this is statistical estimate of x and y. Lower the p value allow us to reject null hypothesis. all type of errors (true positive/negative, false positive/negative) are come to picture if we wrongly analysis p value.
Asterisks mark aside p value define significance of value, lower the value have high significance
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error
Residual standard error: 7557 on 36 degrees of freedom
In normal work, average error of a model, how well our model is doing to predict the data on average
Degree of freedom is like no of data point taken in consideration for estimation taking parameter in account, Not sure but in this case, we total have 43 data point and 7 variable so removed 7 data points (43–7) = 36 degree of freedom
Multiple R-squared and Adjusted R-squared
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
Its always between 0 to 1, high value are better Percentage of variation in the response variable that is explained by variation in the explanatory variable, this is use to calculate how well the model is doing to explain the things, when we increase no of variable then it will also increase and there are no proper limit to define how much we can increase.
We are taking dusted value in which we does not take all variables, only significant variable are considered in adjusted R squared
F-statistic
F-statistic: 73.82 on 6 and 36 DF
This is showing relationship between predictor and response, higher the value will give more reasons to reject null hypothesis, its significant of overall model not any specific parameter
DF — Degree of Freedom
p-value
p-value: < 2.2e-16
Overall p value on the basis of F-statistic, normally p value less than 0.05 indicate that overall model is significant
So the Python
I am using OLS (Ordinary least squares) approach but the same can be produced using SciPy which gives more standard result.
import pandas as pd
import scipy.stats as stats
from statsmodels.formula.api import olsmlr = ols("BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM", df).fit()
print(mlr.summary())
Most the parameters are matching with R output and the rest of parameters can be used for next research work :)
All the description is based on general perceptions, Please let me know if something wrong and your feedback is highly welcomed.