ML20: Stepwise Linear Regression with R

With higher-degree terms & interactions

Published in

Analytics Vidhya

7 min readJan 17, 2021

Keywords: stepwise linear regression, higher-degree terms, interactions, AIC, BIC, correlation heat map, scatter plot

Complete R code on Colab: https://bit.ly/3oRq3cR

Linear regression is an essential yet often underrated model in ML.
LR offers a quick walk-through in preparation for implementing more sophisticated ML modeling and more complex analysis.
Furthermore, LR can serves as the baseline model to evaluate the performances of more sophisticated ML models.

Based on ML19, we now move forward and take a closer a look at linear regression using R to handle the toy dataset “fuel2001” (fuel data) given by “Applied Linear Regression (4th ed.)” [3]. In Chapter 1 & Chapter 3, this notable textbook leverage this toy dataset to concisely demonstrate scatter plot matrix, summary statistics, correlation matrix and multiple linear regression via R.

ML19: The “Linear” in Linear Regression

Does this “linear” represent linear function or linear map?

medium.com

Outline
(1) Using R in Colab
(2) Data Source: Fuel Consumption
(3) Data Preprocessing
3–1 Deleting a column
3–2 Transformation
3–3 Retain needed columns
(4) Data Exploration
4–1 Descriptive statistics
4–2 Correlation matrix
4–3 Scatter plot matrix
4–4 Scatter plot
(5) Linear Regression
5– 1 Information criteria: AIC vs. BIC
5–2 LR
5–3 LR with interactions
5–4 LR with interactions & higher-degree terms
5–5 Stepwise LR using lm()
5–6 Stepwise LR using lm() & BIC
5–7 Stepwise LR using glm() & BIC
(6) Summary
(7) References

(1) Using R in Colab [4]

For a new R-notebook, use this link. (shorthand is colab.to/r )

You can learn from IRkernel demos, e.g., demo.ipynb.

(2) Data Source: Fuel Consumption [3][6]

The reader could download the “fuel2001.csv” dataset below.
http://users.stat.umn.edu/~sandy/alr4ed/data/

Fuel Consumption
The goal of this example is to understand how fuel consumption varies over
the 50 United States and the District of Columbia (Federal Highway Administration, 2001). Table 1.1 describes the variables to be used in this example; the data are given in the file fuel2001. The data were collected by the U.S. Federal Highway Administration.

1. Shape: 51 observations of 7 variables
2. Target: “FuelC”
3. Features: “Drivers”, “Income”, “Miles”, “MPC”, “Pop”, “Tax”

(3) Data Preprocessing

3–1 Deleting a column

fuel2001 = read.table("fuel2001.csv",sep=",",header = T)
head(fuel2001, n = 4)fuel2001_2 = fuel2001[,c(-1)]
names(fuel2001_2)

We find a odd and useless column, so we remove it.

3–2 Transformation

fuel2001_3 <- transform(fuel2001_2,
Dlic=1000 * Drivers/Pop,
Fuel=1000 * FuelC/Pop,
Income_2 =Income/1000,
log_Miles = log(Miles))

We do feature engineering, adding new variables through transform().

3–3 Retaining needed columns

fuel2001_4 = fuel2001_3[, c(7,8,9,10,11)]
names(fuel2001_4)fuel2001_4 = fuel2001_3[, -c(1:6)]
names(fuel2001_4)

Now the information of “fuel2001_4” is as follows:

1. Shape: 51 observations of 5 variables
2. Target: “Fuel”
3. Features: “Tax”, “Dlic”, “Income_2”, “log_Miles”

(4) Data Exploration [1]

4–1 Descriptive statistics

str(fuel2001_4)attributes(fuel2001_4)mode(fuel2001_4$Tax)
class(fuel2001_4$Tax)
typeof(fuel2001_4$Tax)summary(fuel2001_4)colnames(fuel2001_4)
names(fuel2001_4)
rownames(fuel2001_4)
nrow(fuel2001_4)head(fuel2001_4, n = 3)
tail(fuel2001_4, n = 3)dim(fuel2001_4)
length(fuel2001_4)round(cor(fuel2001_4), 4)
round(var(fuel2001_4), 4) # OR cov(fuel2001_4)

4–2 Correlation matrix

corrgram(round(cor(fuel2001_4), 4), order=TRUE, upper.panel=panel.cor)

4–3 Scatter plot matrix

4–4 Scatter plot

(5) Linear Regression

5– 1 Information criteria: AIC vs. BIC [3][8]

AIC = n log(RSSpc/n) + 2 Pc
BIC = n log(RSSpc/n) + log(n) Pc
AIC is given by Sakamoto et al. (1986); BIC is given by Schwarz (1978).
Smaller values are preferred for AIC & BIC.
Information criteria provide balance between lack of fit and complexity by complexity terms, which are essentially penalties preventing model from overfitting.

Apparently, when the sample size n increases, the penalty of AIC would be relatively small and thus can’t reduce the complexity of model. In this scenario, BIC functions better.

5–2 LR

LR_fuel = lm(Fuel ~ Tax  + Dlic + Income_2 + log_Miles, data=fuel2001_4)
summary(LR_fuel)
cat("AIC = ", AIC(LR_fuel), sep = "")

AIC = 577.086
BIC = 588.6769

5–3 LR with interactions

LR_fuel = lm(Fuel ~ Tax + Dlic + Tax:Dlic + Income_2 + log_Miles, data=fuel2001_4)
summary(LR_fuel)
cat("AIC = ", AIC(LR_fuel), sep = "")

AIC = 575.6349
BIC = 589.1577

5–4 LR with interactions & higher-degree terms

LR_fuel = lm(formula = Fuel ~  Tax + Dlic + Tax:Dlic + Income_2 + log_Miles + Tax:Dlic + I(log_Miles**2), data = fuel2001_4)
summary(LR_fuel)
cat("AIC = ", AIC(LR_fuel), sep = "")

AIC = 574.9257
BIC = 592.3121

5–5 Stepwise LR using lm() [7]

null_model = lm(Fuel ~ 1, data=fuel2001_4)
full_model = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4)model_step = step(null_model, 
                  scope = list(lower=null_model, upper=full_model),
                  direction = "both", k= 2) 
                  # direction = c("both", "backward", "forward")
                  # k=2 for AIC; k=log(nrow(train_data)) for BIC

The default information criterion of step() is AIC.
k is the multiple of the number of degrees of freedom used for the penalty. Only k = 2 gives the genuine AIC; k = log(n) is sometimes referred to as BIC or SBC.

After trial and error, the stepwise linear regression gives us the model as follows:

lm(formula = Fuel ~ I(Income_2³) + Dlic + log_Miles + I(Tax³) + I(Dlic²), data = fuel2001_4)

AIC = 573.7994
BIC = 587.3221

5–6 Stepwise LR using lm() & BIC

null_model = lm(Fuel ~ 1, data=fuel2001_4)
full_model = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4)model_step = step(null_model, 
                  scope = list(lower=null_model, upper=full_model),
                  direction = "both", k= log(nrow(fuel2001_4)) ) 
                  # direction = c("both", "backward", "forward")
                  # k=2 for AIC; k=log(nrow(train_data)) for BIC

After trial and error, the stepwise linear regression gives us the model as follows:

lm(formula = Fuel ~ I(Income_2³) + Dlic + I(Tax³) + I(Dlic²), data = fuel2001_4)

AIC = 575.0315
BIC = 586.6225

5–7 Stepwise LR using glm() & BIC

null_model = glm(Fuel ~ 1, data=fuel2001_4, family = gaussian(link = "identity"))full_model = glm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4, family = gaussian(link = "identity"))model_step = step(null_model, 
                  scope = list(lower=null_model, upper=full_model),
                  direction = "both", k= log(nrow(fuel2001_4)) ) 
                  # direction = c("both", "backward", "forward")
                  # k=2 for AIC; k=log(nrow(train_data)) for BIC

glm() is just a general form of lm(). glm() enables logistic regression.

After trial and error, the stepwise linear regression gives us the model as follows:

glm(formula = Fuel ~ I(Income_2³) + Dlic + I(Tax³) + I(Dlic²), family = gaussian(link = “identity”), data = fuel2001_4)

AIC = 575.0315
BIC = 586.6225

(6) Summary

We probe into a toy dataset “fuel2001” given by “Applied Linear Regression (4th ed.)” to implement linear regression using R. We not only reproduce the outcomes in Chapter 1 & Chapter 3 of “Applied Linear Regression (4th ed.)”, but also extend this toy example to a complete data analysis process.
In general, steps of data analysis process [2] are comprised of

Ideation
Retrieval
Preparation
Exploration
Modeling
Presentation
Reproduction

And we carry out from preprocessing, exploration to stepwise linear regression. In other words, we do from retrieval, preparation, exploration, modeling to presentation in this article.

3. We experience the power of stepwise regression with interactions and higher-degree terms (degree>1), which is rarely mentioned in ML/DS books or articles on the Internet.

This is exactly where R prevails Python. We may quickly select desired features for explanation or prediction through this stepwise linear regression with interactions and higher-degree terms (degree>1). Then, these selected features can be put into more complex model such as SVM, random forest, XGBoost, saving great amount of time.

The reader may check the corresponding article using Python conducting the same analysis in ML21.

ML21: Linear Regression with Python

With higher-degree terms & interactions

medium.com

(7) References

[1] Lander J.P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Massachusetts, MA: Addison-Wesley Professional.

[2] Heydt, M. (2017). Learning pandas: High-performance data manipulation and analysis in Python (2nd ed.). Birmingham, UK: Packt Publishing.

[3] Weisberg, S. (2014). Applied Linear Regression (4th ed.). New Jersey, NJ: John Wiley & Sons.

[4] korakot (2019). How to use R with Google Colaboratory? Retrieved from

How to use R with Google Colaboratory?

Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share…

stackoverflow.com

[5] Unidentified (Unidentified). Scatter Plot Matrices — R Base Graphs. Retrieved from

Scatter Plot Matrices - R Base Graphs

Previously, we described the essentials of R programming and provided quick start guides for importing data into R…

www.sthda.com

[6] RDocumentation (Unidentified). fuel2001: Fuel Consumption. Retrieved from

fuel2001

Fuel consumption Data on motor fuel consumption and related variables, for the year 2001. The unit is a state in the…

www.rdocumentation.org

(Chinese)

[7] 陳景祥 (2018)。R軟體：應用統計方法 (第二版)。台北：東華書局。

[8] 钱魏Way (2020)。最优模型选择准则：AIC和BIC。取自
https://www.biaodianfu.com/aic-bic.html

ML20: Stepwise Linear Regression with R

With higher-degree terms & interactions

ML19: The “Linear” in Linear Regression

Does this “linear” represent linear function or linear map?

(1) Using R in Colab [4]

(2) Data Source: Fuel Consumption [3][6]

(3) Data Preprocessing

3–1 Deleting a column

3–2 Transformation

3–3 Retaining needed columns

(4) Data Exploration [1]

4–1 Descriptive statistics

4–2 Correlation matrix

4–3 Scatter plot matrix

4–4 Scatter plot

(5) Linear Regression

5– 1 Information criteria: AIC vs. BIC [3][8]

5–2 LR

5–3 LR with interactions

5–4 LR with interactions & higher-degree terms

5–5 Stepwise LR using lm() [7]

5–6 Stepwise LR using lm() & BIC

5–7 Stepwise LR using glm() & BIC

(6) Summary

ML21: Linear Regression with Python

With higher-degree terms & interactions

(7) References

How to use R with Google Colaboratory?

Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share…

Scatter Plot Matrices - R Base Graphs

Previously, we described the essentials of R programming and provided quick start guides for importing data into R…

fuel2001

Fuel consumption Data on motor fuel consumption and related variables, for the year 2001. The unit is a state in the…

(Chinese)

Written by Yu-Cheng (Morton) Kuo