# Interpret R Linear/Multiple Regression output (lm output point by point), also with Python

Linear regression is very simple, basic yet very powerful approach to supervised learning. This approach is very good for predictive analysis and build a generic approach to any data before going to more complex machine learning algorithm.

Linear Regression is already discussed a lot and almost all the books who teach us analysis have its description and much more material is available on internet so I am leaving much detail except basic understanding that its all about predicting the quantitative response Y based on single predictor X based on the assumption that there are linear relationship between them, of-course some coefficient, intercept also play a deciding role and don’t forget random error which makes everything more real and earthly, almost everywhere !!!. More detail available at https://en.wikipedia.org/wiki/Linear_regression

Assuming that we know sufficient enough about this concept and trying our hand with real things i.e. writing code in R/Python. Firstly, working with R and taking an already clean standard data, why !!! because getting and cleaning data, then data wrangling is almost 60–70% of any data science or machine learning assignment.

# Know your data

`library(alr3)`

Loading required package: car

library(corrplot)

data(water) *## load the data*

head(water) *## view the data*

Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM

1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235

2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567

3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161

4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094

5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080

6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594

filter.water <- water[,-1] *## Remove unwanted year *

*# Visualize the data *

library(GGally)

ggpairs(filter.water) *## It's multivaribale regaression*

# LM magic begins, thanks to R

It is like yi = b0 + b1xi1 + b2xi2 + … bpxip + ei for i = 1,2, … n. here y = BSAAM and x1…xn is all other variables

`mlr <- lm(BSAAM~., data = filter.water)`

summary(mlr)

*# Output *

Call:

lm(formula = BSAAM ~ ., data = filter.water)

Residuals:

Min 1Q Median 3Q Max

-12690 -4936 -1424 4173 18542

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15944.67 4099.80 3.889 0.000416 ***

APMAM -12.77 708.89 -0.018 0.985725

APSAB -664.41 1522.89 -0.436 0.665237

APSLAKE 2270.68 1341.29 1.693 0.099112 .

OPBPC 69.70 461.69 0.151 0.880839

OPRC 1916.45 641.36 2.988 0.005031 **

OPSLAKE 2211.58 752.69 2.938 0.005729 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7557 on 36 degrees of freedom

Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123

F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16

# Output Explained

# Residuals

Normally it gives a basic idea about difference between the observed value of the dependent variable (Y) and the predicted value (X), it gives specific detail i.e. minimum, first quarter, median, third quarter and max value, normally it does not used in our analysis

# Coefficients-Intercept

We can see a all the remaining variable comes with one more row ‘Intercept’, Intercept is giving data when all the variables are 0 so all the measure done without considering any variable, this is again not much used in normal cases, it’s average value of y when x = 0

*# Estimate Std. Error t value Pr(>|t|) *

*# (Intercept) 15944.67 4099.80 3.889 0.000416 ****

# Coefficient-Estimate

This is a one unit increase in X then expected change in Y, in this case one unit change in OPS LAKE then 2211.58 unit change in BSAAM

# Coefficient-Std. Error

The standard deviation of an estimate is called the standard error. The standard error of the coefficient measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.

Low value of this error will be helpful for our analysis, also used for checking confidence interval

# Coefficient-t value

t value = estimate/std error

high t value will be helpful for our analysis as this would indicate we could reject the null hypothesis, it is using to calculate p value

# Coefficient Pr(>|t|)

individual p value for each parameter to accept or reject null hypothesis, this is statistical estimate of x and y. Lower the p value allow us to reject null hypothesis. all type of errors (true positive/negative, false positive/negative) are come to picture if we wrongly analysis p value.

Asterisks mark aside p value define significance of value, lower the value have high significance

`# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

# Residual standard error

`Residual standard error: 7557 on 36 degrees of freedom`

In normal work, average error of a model, how well our model is doing to predict the data on average

Degree of freedom is like no of data point taken in consideration for estimation taking parameter in account, Not sure but in this case, we total have 43 data point and 7 variable so removed 7 data points (43–7) = 36 degree of freedom

# Multiple R-squared and Adjusted R-squared

`Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123`

Its always between 0 to 1, high value are better Percentage of variation in the response variable that is explained by variation in the explanatory variable, this is use to calculate how well the model is doing to explain the things, when we increase no of variable then it will also increase and there are no proper limit to define how much we can increase.

We are taking dusted value in which we does not take all variables, only significant variable are considered in adjusted R squared

# F-statistic

`F-statistic: 73.82 on 6 and 36 DF`

This is showing relationship between predictor and response, higher the value will give more reasons to reject null hypothesis, its significant of overall model not any specific parameter

DF — Degree of Freedom

# p-value

`p-value: < 2.2e-16`

Overall p value on the basis of F-statistic, normally p value less than 0.05 indicate that overall model is significant

# So the Python

I am using OLS (Ordinary least squares) approach but the same can be produced using SciPy which gives more standard result.

import pandas as pd

import scipy.stats as stats

from statsmodels.formula.api import olsmlr = ols("BSAAM~OPSLAKE+OPRC+OPBPC+APSLAKE+APSAB+APMAM", df).fit()

print(mlr.summary())

Most the parameters are matching with R output and the rest of parameters can be used for next research work :)

All the description is based on general perceptions, Please let me know if something wrong and your feedback is highly welcomed.