Interpret R Linear/Multiple Regression output

(lm output point by point), also with Python

Vineet Jaiswal
Analytics Vidhya
5 min readFeb 17, 2018

--

Linear regression is very simple, basic yet very powerful approach to supervised learning. This approach is very good for predictive analysis and build a generic approach to any data before going to more complex machine learning algorithm.

Linear Regression is already discussed a lot and almost all the books who teach us analysis have its description and much more material is available on internet so I am leaving much detail except basic understanding that its all about predicting the quantitative response Y based on single predictor X based on the assumption that there are linear relationship between them, of-course some coefficient, intercept also play a deciding role and don’t forget random error which makes everything more real and earthly, almost everywhere !!!. More detail available at https://en.wikipedia.org/wiki/Linear_regression

Assuming that we know sufficient enough about this concept and trying our hand with real things i.e. writing code in R/Python. Firstly, working with R and taking an already clean standard data, why !!! because getting and cleaning data, then data wrangling is almost 60–70% of any data science or machine learning assignment.

Know your data

LM magic begins, thanks to R

It is like yi = b0 + b1xi1 + b2xi2 + … bpxip + ei for i = 1,2, … n. here y = BSAAM and x1…xn is all other variables

Output Explained

Residuals

Normally it gives a basic idea about difference between the observed value of the dependent variable (Y) and the predicted value (X), it gives specific detail i.e. minimum, first quarter, median, third quarter and max value, normally it does not used in our analysis

Coefficients-Intercept

We can see a all the remaining variable comes with one more row ‘Intercept’, Intercept is giving data when all the variables are 0 so all the measure done without considering any variable, this is again not much used in normal cases, it’s average value of y when x = 0

Coefficient-Estimate

This is a one unit increase in X then expected change in Y, in this case one unit change in OPS LAKE then 2211.58 unit change in BSAAM

Coefficient-Std. Error

The standard deviation of an estimate is called the standard error. The standard error of the coefficient measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.

Low value of this error will be helpful for our analysis, also used for checking confidence interval

Coefficient-t value

t value = estimate/std error

high t value will be helpful for our analysis as this would indicate we could reject the null hypothesis, it is using to calculate p value

Coefficient Pr(>|t|)

individual p value for each parameter to accept or reject null hypothesis, this is statistical estimate of x and y. Lower the p value allow us to reject null hypothesis. all type of errors (true positive/negative, false positive/negative) are come to picture if we wrongly analysis p value.

Asterisks mark aside p value define significance of value, lower the value have high significance

Residual standard error

In normal work, average error of a model, how well our model is doing to predict the data on average

Degree of freedom is like no of data point taken in consideration for estimation taking parameter in account, Not sure but in this case, we total have 43 data point and 7 variable so removed 7 data points (43–7) = 36 degree of freedom

Multiple R-squared and Adjusted R-squared

Its always between 0 to 1, high value are better Percentage of variation in the response variable that is explained by variation in the explanatory variable, this is use to calculate how well the model is doing to explain the things, when we increase no of variable then it will also increase and there are no proper limit to define how much we can increase.

We are taking dusted value in which we does not take all variables, only significant variable are considered in adjusted R squared

F-statistic

This is showing relationship between predictor and response, higher the value will give more reasons to reject null hypothesis, its significant of overall model not any specific parameter

DF — Degree of Freedom

p-value

Overall p value on the basis of F-statistic, normally p value less than 0.05 indicate that overall model is significant

So the Python

I am using OLS (Ordinary least squares) approach but the same can be produced using SciPy which gives more standard result.

Most the parameters are matching with R output and the rest of parameters can be used for next research work :)

All the description is based on general perceptions, Please let me know if something wrong and your feedback is highly welcomed.

--

--