# Simple Linear Regression — 7

If you have not read part 6 of the R data analysis series kindly go through the following article where we discussedAdvanced Statistics Using R — 6.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in theGithub link.

To see all the articles written by me kindly use the link,Vivek Srinivasan.

The workhorse of statistical analysis is the linear model, particularly `regression`

. Originally invented by `Francis Galton`

to study the relationships between parents and children, which he described as regressing to the mean, it has become one of the most widely used modeling techniques and has spawned other models such as generalized `linear models`

, `regression trees`

, `penalized regression`

, and many others.

In this article, our discussion mostly revolves around the fundamentals of `regression `

and its association with `ANOVA `

in general. Later we will see an example to demonstrate how regression is applied practically using `R`

. In the following article, we will cover `multiple linear regression `

with a case study and also discuss important `model diagnostics`

approaches.

## Simple Linear Regression

In its simplest form `regression `

is used to determine the relationship between two variables. That is, given one variable, it tells us what we can expect from the other variable. This powerful tool, which is frequently taught and can accomplish a great deal of analysis with minimal effort, is called `simple linear regression`

.

Before we go any further, we clarify some terminology. The outcome variable (what we are trying to predict) is called the `response`

, and the input variable (what we are using to predict) is the `predictor`

. Fields outside of statistics use other terms, such as `measured `

variable, `outcome `

variable and `experimental `

variable for `response`

, and `covariate`

, `feature `

and `explanatory `

variable for predictor. Worst of all are the terms `dependent `

(response) and `independent `

(predictor) variables.

These very names are misnomers. According to probability theory, if variable y is dependent on variable x, then variable x cannot be independent of variable y. So we stick with the terms response and predictor exclusively.

The general idea behind simple `linear regression`

is using the predictor to come up with some average value of the response. The relationship is defined as

The equation is essentially describing a straight line that goes through the data where `a`

is the `y-intercept `

and `b `

is the `slope`

.. In this case, we are using the fathers’ heights as the predictor and the sons’ heights as the response. The blue line running through the points is the regression line and the gray band around it represents the uncertainty in the fit. Error term in the equation is to say that there are normally distributed errors.

`data(father.son, package='UsingR')`

library(ggplot2)

head(father.son)

Now let us plot these data to understand the relationship between the variables.

`ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point() + geom_smooth(method="lm") + labs(x="Fathers", y="Sons")`

While that code generated a nice graph showing the results of the `regression (generated with geom_smooth(method=“lm”))`

, it did not actually make those results available to us. To actually calculate a regression, use the `lm `

function.

`heightsLM <- lm(sheight ~ fheight, data=father.son)`

heightsLM

Here we once again see the formula notation that specifies to regress `sheight `

(the response) on `fheight `

(the predictor), using the `father.son`

data, and adds the intercept term automatically. The results show coefficients for (`Intercept`

) and `fheight `

which is the `slope `

for the `fheight`

, predictor. The interpretation of this is that, for every extra inch of height in a father, we expect an extra half-inch in height for his son. The `intercept`

, in this case, does not make much sense because it represents the height of a son whose father had zero height, which obviously cannot exist in reality.

While the point estimates for the coefficients are nice, they are not very helpful without the standard errors, which give the sense of uncertainty about the estimate and are similar to standard deviations. To quickly see a full report on the model, use `summary`

.

`summary(heightsLM)`

This prints out a lot more information about the model, including the `standard errors`

, `t-test`

values and `p-values`

for the `coefficients`

, the `degrees of freedom`

, `residual summary`

statistics and the results of an `F-test`

. This is all diagnostic information to check the fit of the model. Details of the statistics are covered extensively when we discuss `multiple linear regression`

in the next article. At present let us take a step back and discuss the about how `linear regression`

can be utilized in place of `ANOVA`

.

## ANOVA Alternative

An alternative to running an `ANOVA `

test (discussed in previous article) is to fit a `regression `

with just one categorical variable and no `intercept `

term. To see this we use the tips data in the `reshape2 `

package on which we will fit a regression.

`data(tips, package="reshape2")`

head(tips)

We will first fit an ANOVA model similar to how we did in our previous article. Barring all other statistical measures displayed below just concentrating on `p-value`

alone shows us that there is a significant difference between days in the tips data.

`# putting -1 in the formula indicates that the intercept should not be`

# included in the model;

# the categorical variable day is automatically setup to have a

# coefficient for each level

tipsAnova <- aov(tip ~ day - 1, data=tips)

summary(tipsAnova)

Now let us fit the `regression `

model for the same data and compare the results.

`tipsLM <- lm(tip ~ day - 1, data=tips)`

summary(tipsLM)

Let us understand the different statistical measures displayed above and the importance of them in the model building.`R-squared`

measures the strength of the relationship between your model and the dependent variable. However, it is not a formal test for the relationship. We have a really good `R-Square`

value for the model which indicates our model variables explain most of the variance when compared to the mean estimate.

The `F-test`

of overall significance is the `hypothesis test`

for this relationship. If the overall `F-test`

is significant, you can conclude that `R-squared`

does not equal zero, and then `correlation`

between the model and the dependent variable is statistically significant.

Compare the `p-value`

for the `F-test`

to your `significance level`

. If the `p-value`

is less than the `significance level`

, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables. As we see from the results we have `p-value`

less than the significance level so we can safely conclude overall significance of the model is good.

While you `F-test`

measure the overall significance of the model, `t-test`

on the other hand, measures the significance of each variable in the model. Similar to `F-test`

the `p-value`

of each variable is compared to the significance level. If the p-value is less than the significance level then you can conclude that the individual predictor variables are statistically significant.

Generally speaking, if none of your independent variables are statistically significant, the overall `F-test `

is also not statistically significant. Occasionally, the tests can produce conflicting results. This disagreement can occur because the `F-test`

of overall significance assesses all of the coefficients jointly whereas the `t-test`

for each coefficient examine them individually. For example, the overall `F-test`

can find that the coefficients are significant *jointly *while the `t-tests`

can fail to find significance individually.

These conflicting test results can be hard to understand but think about it this way. The `F-test`

sums the predictive power of all independent variables and determines that it is unlikely that all of the coefficients equal to zero. However, it’s possible that each variable isn’t predictive enough on its own to be statistically significant. In other words, your sample provides sufficient evidence to conclude that your model is significant, but not enough to conclude that any individual variable is significant.

It’s fabulous if your regression model is statistically significant! However, check your `residual plots`

to determine whether the results are trustworthy. We will see more on residual plots when we discuss model diagnostics.

## Comparing Model Statistics

Notice that the `F-value`

or `F-statistic`

is the same for both, as is the `degrees of freedom`

. This is because of the `ANOVA `

and `regression `

were derived along the same lines and can accomplish the same analysis. Visualizing the coefficients and standard errors should show the same results as computing them using the `ANOVA `

formula. The point estimates for the mean are identical and the confidence intervals are similar, the difference due to slightly different calculations.

`# first calculate the means and CI manually`

library(dplyr)

tipsByDay <- tips %>%

group_by(day) %>%

dplyr::summarize(

tip.mean=mean(tip), tip.sd=sd(tip),

Length=NROW(tip),

tfrac=qt(p=.90, df=Length-1),

Lower=tip.mean - tfrac*tip.sd/sqrt(Length),

Upper=tip.mean + tfrac*tip.sd/sqrt(Length)

)

tipsByDay

I have used `dplyr `

library by Hadley Wickham to calculate the `mean `

and `Confidence Intervals (CI)`

of the data. We have not discussed `dplyr `

library so far in our data science using `R`

series. It is series by itself and we will be covering those topics extensively in

series. As of now the above code splits the data by each day and calculates the *advanced R programming*`mean `

and `CI`

for the data.

Now we calculate similar measures using the linear model summary which we fit on `tips `

data.

tipsInfo <- summary(tipsLM)

tipsCoef <- as.data.frame(tipsInfo$coefficients[, 1:2])

tipsCoef <- within(tipsCoef, {

Lower <- Estimate - qt(p=0.90, df=tipsInfo$df[2]) * `Std. Error`

Upper <- Estimate + qt(p=0.90, df=tipsInfo$df[2]) * `Std. Error`

day <- rownames(tipsCoef)

})tipsCoef

We can compare two tables and understand that the manually calculated and measures calculated from the regression model are similar to each other. It is easier to visually understand then comparing the table of values.

ggplot(tipsByDay, aes(x=tip.mean, y=day)) + geom_point() +

geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3) +

ggtitle("Tips by day calculated manually")ggplot(tipsCoef, aes(x=Estimate, y=day)) + geom_point() +

geom_errorbarh(aes(xmin=Lower, xmax=Upper), height=.3) +

ggtitle("Tips by day calculated from regression model")

A new function and a new feature were used here. First, we introduced `within`

, which is similar to `with`

in that it lets us refer to columns in a `data.frame `

by name but different in that we can create new columns within that `data.frame`

, hence the name. This function has largely been superseded by `mutate `

in `dplyr `

but is still good to know.

Second, one of the columns was named `Std. Error `

with space. In order to refer to a variable with spaces in its name, even as a column in a `data.frame`

, we must enclose the name in `back ticks (`)`

.

Multiple Linear Regression Using R — 7

Do share your thoughts and support by commenting and sharing the article among your peer groups.