# ML20: Stepwise Linear Regression with R

## With higher-degree terms & interactions

**Keywords**: stepwise linear regression, higher-degree terms, interactions, AIC, BIC, correlation heat map, scatter plot

**Complete R code on Colab**: https://bit.ly/3oRq3cR

- Linear regression is an essential yet often underrated model in ML.
- LR offers a quick walk-through
**in preparation for**implementing more sophisticated ML modeling and more complex analysis. - Furthermore, LR can serves as the
**baseline model**to evaluate the performances of more sophisticated ML models.

Based on ML19, we now move forward and take a closer a look at linear regression using R to handle the toy dataset “fuel2001” (fuel data) given by “*Applied Linear Regression (4th ed.)*” [3]. In Chapter 1 & Chapter 3, this notable textbook leverage this toy dataset to concisely demonstrate ** scatter plot matrix**,

**,**

*summary statistics***and**

*correlation matrix***via R.**

*multiple linear regression*

Outline(1)Using R in Colab

(2)Data Source: Fuel Consumption

(3)Data Preprocessing

3–1 Deleting a column

3–2 Transformation

3–3 Retain needed columns

(4)Data Exploration

4–1 Descriptive statistics

4–2 Correlation matrix

4–3 Scatter plot matrix

4–4 Scatter plot

(5)Linear Regression

5– 1 Information criteria: AIC vs. BIC

5–2 LR

5–3 LR with interactions

5–4 LR with interactions & higher-degree terms

5–5 Stepwise LR using lm()

5–6 Stepwise LR using lm() & BIC

5–7 Stepwise LR using glm() & BIC

(6)Summary(7)References

# (1) Using R in Colab [4]

For a new R-notebook, use this link. (shorthand is colab.to/r )

You can learn from IRkernel demos, e.g., demo.ipynb.

# (2) Data Source: Fuel Consumption [3][6]

The reader could download the “fuel2001.csv” dataset below.

http://users.stat.umn.edu/~sandy/alr4ed/data/

Fuel Consumption

The goal of this example is to understand how fuel consumption varies over

the 50 United States and the District of Columbia (Federal Highway Administration, 2001). Table 1.1 describes the variables to be used in this example; the data are given in the file fuel2001. The data were collected by the U.S. Federal Highway Administration.

- 1. Shape: 51 observations of 7 variables
- 2. Target: “FuelC”
- 3. Features: “Drivers”, “Income”, “Miles”, “MPC”, “Pop”, “Tax”

# (3) Data Preprocessing

## 3–1 Deleting a column

fuel2001 = read.table("fuel2001.csv",sep=",",header = T)

head(fuel2001, n = 4)fuel2001_2 = fuel2001[,c(-1)]

names(fuel2001_2)

We find a odd and useless column, so we remove it.

## 3–2 Transformation

`fuel2001_3 <- transform(fuel2001_2,`

Dlic=1000 * Drivers/Pop,

Fuel=1000 * FuelC/Pop,

Income_2 =Income/1000,

log_Miles = log(Miles))

We do feature engineering, adding new variables through *transform()*.

## 3–3 Retaining needed columns

fuel2001_4 = fuel2001_3[, c(7,8,9,10,11)]

names(fuel2001_4)fuel2001_4 = fuel2001_3[, -c(1:6)]

names(fuel2001_4)

Now the information of “fuel2001_4” is as follows:

- 1. Shape: 51 observations of 5 variables
- 2. Target: “Fuel”
- 3. Features: “Tax”, “Dlic”, “Income_2”, “log_Miles”

# (4) Data Exploration [1]

## 4–1 Descriptive statistics

str(fuel2001_4)attributes(fuel2001_4)mode(fuel2001_4$Tax)

class(fuel2001_4$Tax)

typeof(fuel2001_4$Tax)summary(fuel2001_4)colnames(fuel2001_4)

names(fuel2001_4)

rownames(fuel2001_4)

nrow(fuel2001_4)head(fuel2001_4, n = 3)

tail(fuel2001_4, n = 3)dim(fuel2001_4)

length(fuel2001_4)round(cor(fuel2001_4), 4)

round(var(fuel2001_4), 4) # OR cov(fuel2001_4)

## 4–2 Correlation matrix

`corrgram(round(cor(fuel2001_4), 4), order=TRUE, upper.panel=panel.cor)`

## 4–3 Scatter plot matrix

## 4–4 Scatter plot

# (5) Linear Regression

## 5– 1 Information criteria: AIC vs. BIC [3][8]

- AIC = n log(RSSpc/n) + 2 Pc
- BIC = n log(RSSpc/n) + log(n) Pc
- AIC is given by Sakamoto et al. (1986); BIC is given by Schwarz (1978).
- Smaller values are preferred for AIC & BIC.
- Information criteria provide balance between
and*lack of fit*by complexity terms, which are essentially penalties preventing model from overfitting.*complexity*

Apparently, when the ** sample size n** increases, the penalty of AIC would be relatively small and thus can’t reduce the complexity of model. In this scenario, BIC functions better.

## 5–2 LR

`LR_fuel = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles, data=fuel2001_4)`

summary(LR_fuel)

cat("AIC = ", AIC(LR_fuel), sep = "")

- AIC = 577.086
- BIC = 588.6769

## 5–3 LR with interactions

`LR_fuel = lm(Fuel ~ Tax + Dlic + Tax:Dlic + Income_2 + log_Miles, data=fuel2001_4)`

summary(LR_fuel)

cat("AIC = ", AIC(LR_fuel), sep = "")

- AIC = 575.6349
- BIC = 589.1577

## 5–4 LR with interactions & higher-degree terms

`LR_fuel = lm(formula = Fuel ~ Tax + Dlic + Tax:Dlic + Income_2 + log_Miles + Tax:Dlic + I(log_Miles**2), data = fuel2001_4)`

summary(LR_fuel)

cat("AIC = ", AIC(LR_fuel), sep = "")

- AIC = 574.9257
- BIC = 592.3121

## 5–5 Stepwise LR using lm() [7]

null_model = lm(Fuel ~ 1, data=fuel2001_4)

full_model = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4)model_step = step(null_model,

scope = list(lower=null_model, upper=full_model),

direction = "both", k= 2)

# direction = c("both", "backward", "forward")

# k=2 for AIC; k=log(nrow(train_data)) for BIC

- The default information criterion of
*step()*is AIC. - k is the multiple of the number of degrees of freedom used for the penalty. Only k = 2 gives the genuine AIC; k = log(n) is sometimes referred to as BIC or SBC.

After trial and error, the stepwise linear regression gives us the model as follows:

lm(formula = Fuel ~ I(Income_2³) + Dlic + log_Miles + I(Tax³) + I(Dlic²), data = fuel2001_4)

- AIC = 573.7994
- BIC = 587.3221

## 5–6 Stepwise LR using lm() & BIC

null_model = lm(Fuel ~ 1, data=fuel2001_4)

full_model = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4)model_step = step(null_model,

scope = list(lower=null_model, upper=full_model),

direction = "both", k= log(nrow(fuel2001_4)) )

# direction = c("both", "backward", "forward")

# k=2 for AIC; k=log(nrow(train_data)) for BIC

After trial and error, the stepwise linear regression gives us the model as follows:

lm(formula = Fuel ~ I(Income_2³) + Dlic + I(Tax³) + I(Dlic²), data = fuel2001_4)

- AIC = 575.0315
- BIC = 586.6225

## 5–7 Stepwise LR using glm() & BIC

null_model = glm(Fuel ~ 1, data=fuel2001_4, family = gaussian(link = "identity"))full_model = glm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4, family = gaussian(link = "identity"))model_step = step(null_model,

scope = list(lower=null_model, upper=full_model),

direction = "both", k= log(nrow(fuel2001_4)) )

# direction = c("both", "backward", "forward")

# k=2 for AIC; k=log(nrow(train_data)) for BIC

*glm()* is just a general form of* lm()*. *glm()* enables logistic regression.

After trial and error, the stepwise linear regression gives us the model as follows:

glm(formula = Fuel ~ I(Income_2³) + Dlic + I(Tax³) + I(Dlic²), family = gaussian(link = “identity”), data = fuel2001_4)

- AIC = 575.0315
- BIC = 586.6225

# (6) Summary

- We probe into a toy dataset “fuel2001” given by “
*Applied Linear Regression (4th ed.)*” to implement linear regression using R. We not only reproduce the outcomes in Chapter 1 & Chapter 3 of “*Applied Linear Regression (4th ed.)*”, but also extend this toy example to a complete data analysis process. - In general, steps of data analysis process [2] are comprised of

- Ideation
- Retrieval
- Preparation
- Exploration
- Modeling
- Presentation
- Reproduction

And we carry out from preprocessing, exploration to stepwise linear regression. In other words, we do from retrieval, preparation, exploration, modeling to presentation in this article.

3. We experience the power of **stepwise regression with interactions and higher-degree terms (degree>1)**, which is rarely mentioned in ML/DS books or articles on the Internet.

This is exactly where R prevails Python. We may **quickly select desired features** for explanation or prediction through this stepwise linear regression with interactions and higher-degree terms (degree>1). Then, these selected features can be put into more complex model such as SVM, random forest, XGBoost, saving great amount of time.

The reader may check the corresponding article using Python conducting the same analysis in ML21.

# (7) References

[1] Lander J.P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Massachusetts, MA: Addison-Wesley Professional.

[2] Heydt, M. (2017). Learning pandas: High-performance data manipulation and analysis in Python (2nd ed.). Birmingham, UK: Packt Publishing.

[3] Weisberg, S. (2014). Applied Linear Regression (4th ed.). New Jersey, NJ: John Wiley & Sons.

[4] korakot (2019). How to use R with Google Colaboratory? Retrieved from

[5] Unidentified (Unidentified). Scatter Plot Matrices — R Base Graphs. Retrieved from

[6] RDocumentation (Unidentified). fuel2001: Fuel Consumption. Retrieved from

## (Chinese)

[7] 陳景祥 (2018)。R軟體：應用統計方法 (第二版)。台北：東華書局。

[8] 钱魏Way (2020)。最优模型选择准则：AIC和BIC。取自

https://www.biaodianfu.com/aic-bic.html