Applied Question 9 — ISLR Series: Chapter 3 — Linear Regression

Taraqur Rahman
The Biased Outliers
6 min readMar 18, 2021

In this blog we will walk through one of the questions in R from Chapter 3- Linear Regression from ISLR. This complements the blog we wrote for this chapter (Part I, Part II).

ISLR Q3.9 — Multiple Linear Regression/Auto

This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), sqrt(X), X². Comment on your findings.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR)
library(GGally)
ggpairs(auto)

The graphs on the bottom left visually shows how each predictor relates (or correlates) with each other. The numerical values in the top right provides the correlation values. Based on this graph there are some predictors that are highly correlated and few that are not.

Correlated:

  • Horsepower and weight are highly correlated.
  • Displacement and weight are highly correlated.

Not Correlated

  • MPG and acceleration seems like a nonlinear relationship

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(auto)##             mpg cylinders displ    hp weight accel  year origin ## mpg        1.00     -0.78 -0.81 -0.78  -0.83  0.42  0.58   0.57 ## cylinders -0.78      1.00  0.95  0.84   0.90 -0.50 -0.35  -0.57 ## displ     -0.81      0.95  1.00  0.90   0.93 -0.54 -0.37  -0.61 ## hp        -0.78      0.84  0.90  1.00   0.86 -0.69 -0.42  -0.46 ## weight    -0.83      0.90  0.93  0.86   1.00 -0.42 -0.31  -0.59 ## accel      0.42     -0.50 -0.54 -0.69  -0.42  1.00  0.29   0.21 ## year       0.58     -0.35 -0.37 -0.42  -0.31  0.29  1.00   0.18 ## origin     0.57     -0.57 -0.61 -0.46  -0.59  0.21  0.18   1.00

This matrix prints out the numerical values for the correlation. It presents the same info as the pair plot but in numerical form.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

auto.mlr = lm(mpg ~ . -name, data=Auto)
summary(auto.mlr)

i. Is there a relationship between the predictors and the response?

There are multiple predictors that have a relationship with the response because their associated p-value is significant. The p-value tells us the probability that the coefficient will take a value of 0. The typical threshold for p-value is 0.05. If the probability is below 0.05, then that means chances that it will be 0 is very slim.

ii. Which predictors appear to have a statistically significant relationship to the response?

The predictors: displacement, weight, year, and origin have a statistically significant relationship.

iii. What does the coefficient for the year variable suggest?

The coefficient of year is 0.7507 which is about 3/4. This tells us the relationship between year and MPG. It suggests that every 3 years, the mpg goes up by 4.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

plot(auto.mlr)

Non-Linearity: The residual plot shows that there is a U-shape pattern in the residuals which might indicate that the data is non-linear.

Non-Constant Variance: The residual plot also shows that the variance is not constant. There is a funnel shape appearing at the end. The spread of residuals starts off small but then increases. This is an example of heteroscedasticity (non-constant variance)

Normal Distributed Residuals: Based on the Normal Q-Q Plot, we can determine if the residuals are normally distributed. The residuals are normally distributed if the observations line up on the dashed line. In this case the a handful of the observations does not lie on the line especially for 323, 327, 326.

Outliers: The Scale-Location plot displays if there are outliers in the data. The data will be an outlier if standardized residual is outside the range of [-3, 3]. Based on this graph, there don’t seem to be any outliers because all values are within the range of [0,2].

High Leverage Points: The Residuals vs Leverage plot shows observations that have high leverage points. The Cook’s distance is shown with the dashed red line. Points that are above the Cook’s distance are high leverage points. Based on the Residuals vs. Leverage graph, there are no observations that provide a high leverage.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Linear Regression assumes that the predictors are additive. This means that a change in one predictor might affect the response BUT it does not affect another predictor. This is not always the case. If one predictor affects another predictor, then this is called interaction effect. To implement an interaction effect in our model, we just multiply the two predictors to create a new term, the interaction term.

interact.fit = lm(mpg~.-name + horsepower*displacement, data=Auto)
origin.hp = lm(mpg~.-name + horsepower*origin, data=Auto)
summary(origin.hp)

In the R code, we are removing the name column from the model fit and adding an interaction term horsepower * displacement

We applied other interaction terms and some significant interaction terms are

  • displacement and horsepower
  • horsepower and origin

(f) Try a few different transformations of the variables, such as log(X), sqrt(X), X². Comment on your findings.

Log Transformation for Acceleration

summary(lm(mpg ~ . -name + log(acceleration), data=Auto))

log(acceleration) is still very significant but less significant than acceleration.

Square Transform Horsepower

summary(lm(mpg ~ . -name + I(horsepower^2), data=Auto))

Squaring horsepower doesn’t change the significance.

That is the end of the question. This is just an except of the full answer. To view the answer in details, check out the website.

Collaborators: Michael Mellinger

--

--