Simple Linear Regression — 7

Vivekanandan Srinivasan
Analytics Vidhya
Published in
9 min readDec 7, 2019

If you have not read part 6 of the R data analysis series kindly go through the following article where we discussed Advanced Statistics Using R — 6.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

The workhorse of statistical analysis is the linear model, particularly regression. Originally invented by Francis Galton to study the relationships between parents and children, which he described as regressing to the mean, it has become one of the most widely used modeling techniques and has spawned other models such as generalized linear models, regression trees, penalized regression, and many others.

In this article, our discussion mostly revolves around the fundamentals of regression and its association with ANOVA in general. Later we will see an example to demonstrate how regression is applied practically using R. In the following article, we will cover multiple linear regression with a case study and also discuss important model diagnostics approaches.

Simple Linear Regression

In its simplest form regression is used to determine the relationship between two variables. That is, given one variable, it tells us what we can expect from the other variable. This powerful tool, which is frequently taught and can accomplish a great deal of analysis with minimal effort, is called simple linear regression.

Before we go any further, we clarify some terminology. The outcome variable (what we are trying to predict) is called the response, and the input variable (what we are using to predict) is the predictor. Fields outside of statistics use other terms, such as measured variable, outcome variable and experimental variable for response, and covariate, feature and explanatory variable for predictor. Worst of all are the terms dependent (response) and independent (predictor) variables.

These very names are misnomers. According to probability theory, if variable y is dependent on variable x, then variable x cannot be independent of variable y. So we stick with the terms response and predictor exclusively.

The general idea behind simple linear regression is using the predictor to come up with some average value of the response. The relationship is defined as

The equation is essentially describing a straight line that goes through the data where a is the y-intercept and b is the slope.. In this case, we are using the fathers’ heights as the predictor and the sons’ heights as the response. The blue line running through the points is the regression line and the gray band around it represents the uncertainty in the fit. Error term in the equation is to say that there are normally distributed errors.

data(father.son, package='UsingR')
library(ggplot2)
head(father.son)

Now let us plot these data to understand the relationship between the variables.

ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point() +  geom_smooth(method="lm") + labs(x="Fathers", y="Sons")
Using fathers’ heights to predict sons’ heights using simple linear regression. The fathers’ heights are the predictors and the sons’ heights are the responses. The blue line running through the points is the regression line and the gray band around it represents the uncertainty in the fit.

While that code generated a nice graph showing the results of the regression (generated with geom_smooth(method=“lm”)), it did not actually make those results available to us. To actually calculate a regression, use the lm function.

heightsLM <- lm(sheight ~ fheight, data=father.son)
heightsLM

Here we once again see the formula notation that specifies to regress sheight (the response) on fheight (the predictor), using the father.son data, and adds the intercept term automatically. The results show coefficients for (Intercept) and fheight which is the slope for the fheight, predictor. The interpretation of this is that, for every extra inch of height in a father, we expect an extra half-inch in height for his son. The intercept, in this case, does not make much sense because it represents the height of a son whose father had zero height, which obviously cannot exist in reality.

While the point estimates for the coefficients are nice, they are not very helpful without the standard errors, which give the sense of uncertainty about the estimate and are similar to standard deviations. To quickly see a full report on the model, use summary.

summary(heightsLM)

This prints out a lot more information about the model, including the standard errors, t-test values and p-values for the coefficients, the degrees of freedom, residual summary statistics and the results of an F-test. This is all diagnostic information to check the fit of the model. Details of the statistics are covered extensively when we discuss multiple linear regression in the next article. At present let us take a step back and discuss the about how linear regression can be utilized in place of ANOVA.

ANOVA Alternative

An alternative to running an ANOVA test (discussed in previous article) is to fit a regression with just one categorical variable and no intercept term. To see this we use the tips data in the reshape2 package on which we will fit a regression.

data(tips, package="reshape2")
head(tips)

We will first fit an ANOVA model similar to how we did in our previous article. Barring all other statistical measures displayed below just concentrating on p-value alone shows us that there is a significant difference between days in the tips data.

# putting -1 in the formula indicates that the intercept should not be
# included in the model;
# the categorical variable day is automatically setup to have a
# coefficient for each level
tipsAnova <- aov(tip ~ day - 1, data=tips)
summary(tipsAnova)

Now let us fit the regression model for the same data and compare the results.

tipsLM <- lm(tip ~ day - 1, data=tips)
summary(tipsLM)

Let us understand the different statistical measures displayed above and the importance of them in the model building.R-squared measures the strength of the relationship between your model and the dependent variable. However, it is not a formal test for the relationship. We have a really good R-Square value for the model which indicates our model variables explain most of the variance when compared to the mean estimate.

The F-test of overall significance is the hypothesis test for this relationship. If the overall F-test is significant, you can conclude that R-squared does not equal zero, and then correlation between the model and the dependent variable is statistically significant.

Compare the p-value for the F-test to your significance level. If the p-value is less than the significance level, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables. As we see from the results we have p-value less than the significance level so we can safely conclude overall significance of the model is good.

While you F-test measure the overall significance of the model, t-test on the other hand, measures the significance of each variable in the model. Similar to F-test the p-value of each variable is compared to the significance level. If the p-value is less than the significance level then you can conclude that the individual predictor variables are statistically significant.

Generally speaking, if none of your independent variables are statistically significant, the overall F-test is also not statistically significant. Occasionally, the tests can produce conflicting results. This disagreement can occur because the F-test of overall significance assesses all of the coefficients jointly whereas the t-test for each coefficient examine them individually. For example, the overall F-test can find that the coefficients are significant jointly while the t-tests can fail to find significance individually.

These conflicting test results can be hard to understand but think about it this way. The F-test sums the predictive power of all independent variables and determines that it is unlikely that all of the coefficients equal to zero. However, it’s possible that each variable isn’t predictive enough on its own to be statistically significant. In other words, your sample provides sufficient evidence to conclude that your model is significant, but not enough to conclude that any individual variable is significant.

It’s fabulous if your regression model is statistically significant! However, check your residual plots to determine whether the results are trustworthy. We will see more on residual plots when we discuss model diagnostics.

Comparing Model Statistics

Notice that the F-value or F-statistic is the same for both, as is the degrees of freedom. This is because of the ANOVA and regression were derived along the same lines and can accomplish the same analysis. Visualizing the coefficients and standard errors should show the same results as computing them using the ANOVA formula. The point estimates for the mean are identical and the confidence intervals are similar, the difference due to slightly different calculations.

# first calculate the means and CI manually
library(dplyr)
tipsByDay <- tips %>%
group_by(day) %>%
dplyr::summarize(
tip.mean=mean(tip), tip.sd=sd(tip),
Length=NROW(tip),
tfrac=qt(p=.90, df=Length-1),
Lower=tip.mean - tfrac*tip.sd/sqrt(Length),
Upper=tip.mean + tfrac*tip.sd/sqrt(Length)
)
tipsByDay

I have used dplyr library by Hadley Wickham to calculate the mean and Confidence Intervals (CI)of the data. We have not discussed dplyr library so far in our data science using Rseries. It is series by itself and we will be covering those topics extensively in advanced R programming series. As of now the above code splits the data by each day and calculates the mean and CI for the data.

Now we calculate similar measures using the linear model summary which we fit on tips data.

tipsInfo <- summary(tipsLM)
tipsCoef <- as.data.frame(tipsInfo$coefficients[, 1:2])
tipsCoef <- within(tipsCoef, {
Lower <- Estimate - qt(p=0.90, df=tipsInfo$df[2]) * `Std. Error`
Upper <- Estimate + qt(p=0.90, df=tipsInfo$df[2]) * `Std. Error`
day <- rownames(tipsCoef)
})
tipsCoef

We can compare two tables and understand that the manually calculated and measures calculated from the regression model are similar to each other. It is easier to visually understand then comparing the table of values.

ggplot(tipsByDay, aes(x=tip.mean, y=day)) + geom_point() +
geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3) +
ggtitle("Tips by day calculated manually")
ggplot(tipsCoef, aes(x=Estimate, y=day)) + geom_point() +
geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3) +
ggtitle("Tips by day calculated from regression model")

A new function and a new feature were used here. First, we introduced within, which is similar to with in that it lets us refer to columns in a data.frame by name but different in that we can create new columns within that data.frame, hence the name. This function has largely been superseded by mutate in dplyr but is still good to know.

Second, one of the columns was named Std. Error with space. In order to refer to a variable with spaces in its name, even as a column in a data.frame, we must enclose the name in back ticks (`).

Multiple Linear Regression Using R — 7

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.