Published in

Analytics Vidhya

ML20: Stepwise Linear Regression with R

With higher-degree terms & interactions

Keywords: stepwise linear regression, higher-degree terms, interactions, AIC, BIC, correlation heat map, scatter plot

Complete R code on Colab: https://bit.ly/3oRq3cR

• Linear regression is an essential yet often underrated model in ML.
• LR offers a quick walk-through in preparation for implementing more sophisticated ML modeling and more complex analysis.
• Furthermore, LR can serves as the baseline model to evaluate the performances of more sophisticated ML models.

Based on ML19, we now move forward and take a closer a look at linear regression using R to handle the toy dataset “fuel2001” (fuel data) given by “Applied Linear Regression (4th ed.)” [3]. In Chapter 1 & Chapter 3, this notable textbook leverage this toy dataset to concisely demonstrate scatter plot matrix, summary statistics, correlation matrix and multiple linear regression via R.

Outline
(1) Using R in Colab

(3) Data Preprocessing
3–1 Deleting a column
3–2 Transformation
3–3 Retain needed columns

(4) Data Exploration
4–1 Descriptive statistics
4–2 Correlation matrix
4–3 Scatter plot matrix
4–4 Scatter plot

(5) Linear Regression
5– 1 Information criteria: AIC vs. BIC
5–2 LR
5–3 LR with interactions
5–4 LR with interactions & higher-degree terms
5–5 Stepwise LR using lm()
5–6 Stepwise LR using lm() & BIC
5–7 Stepwise LR using glm() & BIC

(6) Summary
(7) References

(1) Using R in Colab [4]

For a new R-notebook, use this link. (shorthand is colab.to/r )

You can learn from IRkernel demos, e.g., demo.ipynb.

(2) Data Source: Fuel Consumption [3][6]

http://users.stat.umn.edu/~sandy/alr4ed/data/

Fuel Consumption
The goal of this example is to understand how fuel consumption varies over
the 50 United States and the District of Columbia (Federal Highway Administration, 2001). Table 1.1 describes the variables to be used in this example; the data are given in the file fuel2001. The data were collected by the U.S. Federal Highway Administration.

• 1. Shape: 51 observations of 7 variables
• 2. Target: “FuelC”
• 3. Features: “Drivers”, “Income”, “Miles”, “MPC”, “Pop”, “Tax”

(3) Data Preprocessing

3–1 Deleting a column

`fuel2001 = read.table("fuel2001.csv",sep=",",header = T)head(fuel2001, n = 4)fuel2001_2 = fuel2001[,c(-1)]names(fuel2001_2)`

We find a odd and useless column, so we remove it.

3–2 Transformation

`fuel2001_3 <- transform(fuel2001_2,Dlic=1000 * Drivers/Pop,Fuel=1000 * FuelC/Pop,Income_2 =Income/1000,log_Miles = log(Miles))`

We do feature engineering, adding new variables through transform().

3–3 Retaining needed columns

`fuel2001_4 = fuel2001_3[, c(7,8,9,10,11)]names(fuel2001_4)fuel2001_4 = fuel2001_3[, -c(1:6)]names(fuel2001_4)`

Now the information of “fuel2001_4” is as follows:

• 1. Shape: 51 observations of 5 variables
• 2. Target: “Fuel”
• 3. Features: “Tax”, “Dlic”, “Income_2”, “log_Miles”

(4) Data Exploration [1]

4–1 Descriptive statistics

`str(fuel2001_4)attributes(fuel2001_4)mode(fuel2001_4\$Tax)class(fuel2001_4\$Tax)typeof(fuel2001_4\$Tax)summary(fuel2001_4)colnames(fuel2001_4)names(fuel2001_4)rownames(fuel2001_4)nrow(fuel2001_4)head(fuel2001_4, n = 3)tail(fuel2001_4, n = 3)dim(fuel2001_4)length(fuel2001_4)round(cor(fuel2001_4), 4)round(var(fuel2001_4), 4) # OR cov(fuel2001_4)`

4–2 Correlation matrix

`corrgram(round(cor(fuel2001_4), 4), order=TRUE, upper.panel=panel.cor)`

(5) Linear Regression

5– 1 Information criteria: AIC vs. BIC [3][8]

• AIC = n log(RSSpc/n) + 2 Pc
• BIC = n log(RSSpc/n) + log(n) Pc
• AIC is given by Sakamoto et al. (1986); BIC is given by Schwarz (1978).
• Smaller values are preferred for AIC & BIC.
• Information criteria provide balance between lack of fit and complexity by complexity terms, which are essentially penalties preventing model from overfitting.

Apparently, when the sample size n increases, the penalty of AIC would be relatively small and thus can’t reduce the complexity of model. In this scenario, BIC functions better.

5–2 LR

`LR_fuel = lm(Fuel ~ Tax  + Dlic + Income_2 + log_Miles, data=fuel2001_4)summary(LR_fuel)cat("AIC = ", AIC(LR_fuel), sep = "")`
• AIC = 577.086
• BIC = 588.6769

5–3 LR with interactions

`LR_fuel = lm(Fuel ~ Tax + Dlic + Tax:Dlic + Income_2 + log_Miles, data=fuel2001_4)summary(LR_fuel)cat("AIC = ", AIC(LR_fuel), sep = "")`
• AIC = 575.6349
• BIC = 589.1577

5–4 LR with interactions & higher-degree terms

`LR_fuel = lm(formula = Fuel ~  Tax + Dlic + Tax:Dlic + Income_2 + log_Miles + Tax:Dlic + I(log_Miles**2), data = fuel2001_4)summary(LR_fuel)cat("AIC = ", AIC(LR_fuel), sep = "")`
• AIC = 574.9257
• BIC = 592.3121

5–5 Stepwise LR using lm() [7]

`null_model = lm(Fuel ~ 1, data=fuel2001_4)full_model = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4)model_step = step(null_model,                   scope = list(lower=null_model, upper=full_model),                  direction = "both", k= 2)                   # direction = c("both", "backward", "forward")                  # k=2 for AIC; k=log(nrow(train_data)) for BIC`
• The default information criterion of step() is AIC.
• k is the multiple of the number of degrees of freedom used for the penalty. Only k = 2 gives the genuine AIC; k = log(n) is sometimes referred to as BIC or SBC.

After trial and error, the stepwise linear regression gives us the model as follows:

lm(formula = Fuel ~ I(Income_2³) + Dlic + log_Miles + I(Tax³) + I(Dlic²), data = fuel2001_4)

• AIC = 573.7994
• BIC = 587.3221

5–6 Stepwise LR using lm() & BIC

`null_model = lm(Fuel ~ 1, data=fuel2001_4)full_model = lm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4)model_step = step(null_model,                   scope = list(lower=null_model, upper=full_model),                  direction = "both", k= log(nrow(fuel2001_4)) )                   # direction = c("both", "backward", "forward")                  # k=2 for AIC; k=log(nrow(train_data)) for BIC`

After trial and error, the stepwise linear regression gives us the model as follows:

lm(formula = Fuel ~ I(Income_2³) + Dlic + I(Tax³) + I(Dlic²), data = fuel2001_4)

• AIC = 575.0315
• BIC = 586.6225

5–7 Stepwise LR using glm() & BIC

`null_model = glm(Fuel ~ 1, data=fuel2001_4, family = gaussian(link = "identity"))full_model = glm(Fuel ~ Tax + Dlic + Income_2 + log_Miles + Tax:Dlic + Tax:Income_2 + Tax:log_Miles + Dlic:Income_2 + Dlic:log_Miles + Income_2:log_Miles + I(Tax^2) + I(Tax^3) + I(Tax^4) + I(Dlic^2) + I(Dlic^3) + I(Dlic^4) + I(Income_2^2) + I(Income_2^3) + I(Income_2^4) + I(log_Miles^2) + I(log_Miles^3) + I(log_Miles^4), data=fuel2001_4, family = gaussian(link = "identity"))model_step = step(null_model,                   scope = list(lower=null_model, upper=full_model),                  direction = "both", k= log(nrow(fuel2001_4)) )                   # direction = c("both", "backward", "forward")                  # k=2 for AIC; k=log(nrow(train_data)) for BIC`

glm() is just a general form of lm(). glm() enables logistic regression.

After trial and error, the stepwise linear regression gives us the model as follows:

glm(formula = Fuel ~ I(Income_2³) + Dlic + I(Tax³) + I(Dlic²), family = gaussian(link = “identity”), data = fuel2001_4)

• AIC = 575.0315
• BIC = 586.6225

(6) Summary

1. We probe into a toy dataset “fuel2001” given by “Applied Linear Regression (4th ed.)” to implement linear regression using R. We not only reproduce the outcomes in Chapter 1 & Chapter 3 of “Applied Linear Regression (4th ed.)”, but also extend this toy example to a complete data analysis process.
2. In general, steps of data analysis process [2] are comprised of
• Ideation
• Retrieval
• Preparation
• Exploration
• Modeling
• Presentation
• Reproduction

And we carry out from preprocessing, exploration to stepwise linear regression. In other words, we do from retrieval, preparation, exploration, modeling to presentation in this article.

3. We experience the power of stepwise regression with interactions and higher-degree terms (degree>1), which is rarely mentioned in ML/DS books or articles on the Internet.

This is exactly where R prevails Python. We may quickly select desired features for explanation or prediction through this stepwise linear regression with interactions and higher-degree terms (degree>1). Then, these selected features can be put into more complex model such as SVM, random forest, XGBoost, saving great amount of time.

The reader may check the corresponding article using Python conducting the same analysis in ML21.

(7) References

[1] Lander J.P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Massachusetts, MA: Addison-Wesley Professional.

[2] Heydt, M. (2017). Learning pandas: High-performance data manipulation and analysis in Python (2nd ed.). Birmingham, UK: Packt Publishing.

[3] Weisberg, S. (2014). Applied Linear Regression (4th ed.). New Jersey, NJ: John Wiley & Sons.

[5] Unidentified (Unidentified). Scatter Plot Matrices — R Base Graphs. Retrieved from

(Chinese)

[7] 陳景祥 (2018)。R軟體：應用統計方法 (第二版)。台北：東華書局。

[8] 钱魏Way (2020)。最优模型选择准则：AIC和BIC。取自
https://www.biaodianfu.com/aic-bic.html

--

--

More from Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Yu-Cheng Kuo

62 Followers

ML/DS using Python & R. A Taiwanese earned MBA from NCCU and BS from NTHU with MATH major & ECON minor. Email: yc.kuo.28@gmail.com