Simple Linear Regression — 7
If you have not read part 6 of the R data analysis series kindly go through the following article where we discussed Advanced Statistics Using R — 6.
The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in the form of R markdown in the Github link.
To see all the articles written by me kindly use the link, Vivek Srinivasan.
The workhorse of statistical analysis is the linear model, particularly regression
. Originally invented by Francis Galton
to study the relationships between parents and children, which he described as regressing to the mean, it has become one of the most widely used modeling techniques and has spawned other models such as generalized linear models
, regression trees
, penalized regression
, and many others.
In this article, our discussion mostly revolves around the fundamentals of regression
and its association with ANOVA
in general. Later we will see an example to demonstrate how regression is applied practically using R
. In the following article, we will cover multiple linear regression
with a case study and also discuss important model diagnostics
approaches.
Simple Linear Regression
In its simplest form regression
is used to determine the relationship between two variables. That is, given one variable, it tells us what we can expect from the other variable. This powerful tool, which is frequently taught and can accomplish a great deal of analysis with minimal effort, is called simple linear regression
.
Before we go any further, we clarify some terminology. The outcome variable (what we are trying to predict) is called the response
, and the input variable (what we are using to predict) is the predictor
. Fields outside of statistics use other terms, such as measured
variable, outcome
variable and experimental
variable for response
, and covariate
, feature
and explanatory
variable for predictor. Worst of all are the terms dependent
(response) and independent
(predictor) variables.
These very names are misnomers. According to probability theory, if variable y is dependent on variable x, then variable x cannot be independent of variable y. So we stick with the terms response and predictor exclusively.
The general idea behind simple linear regression
is using the predictor to come up with some average value of the response. The relationship is defined as
The equation is essentially describing a straight line that goes through the data where a
is the y-intercept
and b
is the slope
.. In this case, we are using the fathers’ heights as the predictor and the sons’ heights as the response. The blue line running through the points is the regression line and the gray band around it represents the uncertainty in the fit. Error term in the equation is to say that there are normally distributed errors.
data(father.son, package='UsingR')
library(ggplot2)
head(father.son)
Now let us plot these data to understand the relationship between the variables.
ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point() + geom_smooth(method="lm") + labs(x="Fathers", y="Sons")
While that code generated a nice graph showing the results of the regression (generated with geom_smooth(method=“lm”))
, it did not actually make those results available to us. To actually calculate a regression, use the lm
function.
heightsLM <- lm(sheight ~ fheight, data=father.son)
heightsLM
Here we once again see the formula notation that specifies to regress sheight
(the response) on fheight
(the predictor), using the father.son
data, and adds the intercept term automatically. The results show coefficients for (Intercept
) and fheight
which is the slope
for the fheight
, predictor. The interpretation of this is that, for every extra inch of height in a father, we expect an extra half-inch in height for his son. The intercept
, in this case, does not make much sense because it represents the height of a son whose father had zero height, which obviously cannot exist in reality.
While the point estimates for the coefficients are nice, they are not very helpful without the standard errors, which give the sense of uncertainty about the estimate and are similar to standard deviations. To quickly see a full report on the model, use summary
.
summary(heightsLM)
This prints out a lot more information about the model, including the standard errors
, t-test
values and p-values
for the coefficients
, the degrees of freedom
, residual summary
statistics and the results of an F-test
. This is all diagnostic information to check the fit of the model. Details of the statistics are covered extensively when we discuss multiple linear regression
in the next article. At present let us take a step back and discuss the about how linear regression
can be utilized in place of ANOVA
.
ANOVA Alternative
An alternative to running an ANOVA
test (discussed in previous article) is to fit a regression
with just one categorical variable and no intercept
term. To see this we use the tips data in the reshape2
package on which we will fit a regression.
data(tips, package="reshape2")
head(tips)
We will first fit an ANOVA model similar to how we did in our previous article. Barring all other statistical measures displayed below just concentrating on p-value
alone shows us that there is a significant difference between days in the tips data.
# putting -1 in the formula indicates that the intercept should not be
# included in the model;
# the categorical variable day is automatically setup to have a
# coefficient for each level
tipsAnova <- aov(tip ~ day - 1, data=tips)
summary(tipsAnova)
Now let us fit the regression
model for the same data and compare the results.
tipsLM <- lm(tip ~ day - 1, data=tips)
summary(tipsLM)
Let us understand the different statistical measures displayed above and the importance of them in the model building.R-squared
measures the strength of the relationship between your model and the dependent variable. However, it is not a formal test for the relationship. We have a really good R-Square
value for the model which indicates our model variables explain most of the variance when compared to the mean estimate.
The F-test
of overall significance is the hypothesis test
for this relationship. If the overall F-test
is significant, you can conclude that R-squared
does not equal zero, and then correlation
between the model and the dependent variable is statistically significant.
Compare the p-value
for the F-test
to your significance level
. If the p-value
is less than the significance level
, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables. As we see from the results we have p-value
less than the significance level so we can safely conclude overall significance of the model is good.
While you F-test
measure the overall significance of the model, t-test
on the other hand, measures the significance of each variable in the model. Similar to F-test
the p-value
of each variable is compared to the significance level. If the p-value is less than the significance level then you can conclude that the individual predictor variables are statistically significant.
Generally speaking, if none of your independent variables are statistically significant, the overall F-test
is also not statistically significant. Occasionally, the tests can produce conflicting results. This disagreement can occur because the F-test
of overall significance assesses all of the coefficients jointly whereas the t-test
for each coefficient examine them individually. For example, the overall F-test
can find that the coefficients are significant jointly while the t-tests
can fail to find significance individually.
These conflicting test results can be hard to understand but think about it this way. The F-test
sums the predictive power of all independent variables and determines that it is unlikely that all of the coefficients equal to zero. However, it’s possible that each variable isn’t predictive enough on its own to be statistically significant. In other words, your sample provides sufficient evidence to conclude that your model is significant, but not enough to conclude that any individual variable is significant.
It’s fabulous if your regression model is statistically significant! However, check your residual plots
to determine whether the results are trustworthy. We will see more on residual plots when we discuss model diagnostics.
Comparing Model Statistics
Notice that the F-value
or F-statistic
is the same for both, as is the degrees of freedom
. This is because of the ANOVA
and regression
were derived along the same lines and can accomplish the same analysis. Visualizing the coefficients and standard errors should show the same results as computing them using the ANOVA
formula. The point estimates for the mean are identical and the confidence intervals are similar, the difference due to slightly different calculations.
# first calculate the means and CI manually
library(dplyr)
tipsByDay <- tips %>%
group_by(day) %>%
dplyr::summarize(
tip.mean=mean(tip), tip.sd=sd(tip),
Length=NROW(tip),
tfrac=qt(p=.90, df=Length-1),
Lower=tip.mean - tfrac*tip.sd/sqrt(Length),
Upper=tip.mean + tfrac*tip.sd/sqrt(Length)
)
tipsByDay
I have used dplyr
library by Hadley Wickham to calculate the mean
and Confidence Intervals (CI)
of the data. We have not discussed dplyr
library so far in our data science using R
series. It is series by itself and we will be covering those topics extensively in advanced R programming
series. As of now the above code splits the data by each day and calculates the mean
and CI
for the data.
Now we calculate similar measures using the linear model summary which we fit on tips
data.
tipsInfo <- summary(tipsLM)
tipsCoef <- as.data.frame(tipsInfo$coefficients[, 1:2])
tipsCoef <- within(tipsCoef, {
Lower <- Estimate - qt(p=0.90, df=tipsInfo$df[2]) * `Std. Error`
Upper <- Estimate + qt(p=0.90, df=tipsInfo$df[2]) * `Std. Error`
day <- rownames(tipsCoef)
})tipsCoef
We can compare two tables and understand that the manually calculated and measures calculated from the regression model are similar to each other. It is easier to visually understand then comparing the table of values.
ggplot(tipsByDay, aes(x=tip.mean, y=day)) + geom_point() +
geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3) +
ggtitle("Tips by day calculated manually")ggplot(tipsCoef, aes(x=Estimate, y=day)) + geom_point() +
geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3) +
ggtitle("Tips by day calculated from regression model")
A new function and a new feature were used here. First, we introduced within
, which is similar to with
in that it lets us refer to columns in a data.frame
by name but different in that we can create new columns within that data.frame
, hence the name. This function has largely been superseded by mutate
in dplyr
but is still good to know.
Second, one of the columns was named Std. Error
with space. In order to refer to a variable with spaces in its name, even as a column in a data.frame
, we must enclose the name in back ticks (`)
.
Multiple Linear Regression Using R — 7
Do share your thoughts and support by commenting and sharing the article among your peer groups.