# ML21: Linear Regression with Python

## With higher-degree terms & interactions

**Keywords**: linear regression, higher-degree terms, interactions, AIC, BIC, correlation heat map, scatter plot

**Complete Python code on Colab**: https://bit.ly/39CEuve

- Linear regression is an essential yet often underrated model in ML.
- LR offers a quick walk-through
**in preparation for**implementing more sophisticated ML modeling and more complex analysis. - Furthermore, LR can serves as the
**baseline model**to evaluate the performances of more sophisticated ML models.

Based on ML20, which use R to do a chain of analysis and reach stepwise linear regression in the end, we try to reproduce the outcomes of ML20 in Python. Also, the reader may check ML19 for more prior knowledge.

We can find most of the corresponding functions in Python; however, ** we can’t find a handy stepwise LR function in Python **[5][8][10][11][12][13][14][15][17][20].

We use Python to handle the toy dataset “fuel2001” (fuel data) given by “*Applied Linear Regression (4th ed.)*” [4]. In Chapter 1 & Chapter 3, this notable textbook leverage this toy dataset to concisely demonstrate ** scatter plot matrix**,

**,**

*summary statistics***and**

*correlation matrix***via R.**

*multiple linear regression*

Outline

(1)Data Source: Fuel Consumption

(2)Data Preprocessing

2–1 Deleting a column

2–2 Transformation

2–3 Retain needed columns

(3)Data Exploration

3–1 Descriptive statistics

3–2 Scatter plot matrix

3–3 Correlation matrix & heat map

3–4 Scatter plot

3–5 Box plot & subplot

(4)Linear Regression

4–1 Information criteria: AIC vs. BIC

4–2 LR

4–3 LR with interactions

4–4 LR with interactions & higher-degree terms

4–5 Run the best model given by R’s step( ) in ML20

(5)Summary(6)References

# (1) Data Source: Fuel Consumption [4]

The reader could download the “fuel2001.csv” dataset below.

http://users.stat.umn.edu/~sandy/alr4ed/data/

Fuel Consumption

The goal of this example is to understand how fuel consumption varies over

the 50 United States and the District of Columbia (Federal Highway Administration, 2001). Table 1.1 describes the variables to be used in this example; the data are given in the file fuel2001. The data were collected by the U.S. Federal Highway Administration.

- 1. Shape: 51 observations of 7 variables
- 2. Target: “FuelC”
- 3. Features: “Drivers”, “Income”, “Miles”, “MPC”, “Pop”, “Tax”

# (2) Data Preprocessing [3]

# 2–1 Deleting a column

fuel2001 = pd.read_csv("fuel2001.csv")fuel2001_2 = fuel2001.drop(['Unnamed: 0'], axis = 1)

fuel2001_2.head(n=3)

We find a odd and useless column, so we remove it.

# 2–2 Transformation

import numpy as npDlic= 1000 * fuel2001_2.Drivers / fuel2001_2.Pop

Fuel= 1000 * fuel2001_2.FuelC / fuel2001_2.Pop

Income_2 = fuel2001_2.Income / 1000

log_Miles = np.log(fuel2001_2.Miles)fuel2001_2['Dlic'] = Dlic

fuel2001_2['Fuel'] = Fuel

fuel2001_2['Income_2'] = Income_2

fuel2001_2['log_Miles'] = log_Miles

We do feature engineering, adding new variables.

# 2–3 Retaining needed columns

`fuel2001_3 = fuel2001_2.iloc[ : , [6,7,8,9,10] ] # new data frame`

Now the information of “fuel2001_4” is as follows:

- 1. Shape: 51 observations of 5 variables
- 2. Target: “Fuel”
- 3. Features: “Tax”, “Dlic”, “Income_2”, “log_Miles”

# (3) Data Exploration

# 3–1 Descriptive statistics [9][19]

print(fuel2001_3.shape)

print(len(fuel2001_3))print(fuel2001_3.index)

print(fuel2001_3.columns)print(fuel2001_3.head(n=3))

print(fuel2001_3.tail(n=3))print(fuel2001_3.info())summary = fuel2001_3.describe()

print(summary)type(summary_2) # type()# Display all columns

pd.set_option('display.max_columns', None)

# Display all rows

pd.set_option('display.max_rows', None)summary_2 = fuel2001_3.describe().transpose()

print(summary_2)

# 3–2 Scatter plot matrix [1][7]

import matplotlib.pyplot as plt

import seaborn as snssns.set(font_scale=1.2)

sns.pairplot(fuel2001_3, kind= 'scatter', height= 2)

plt.show();sns.set(font_scale=1.2)

sns.pairplot(fuel2001_3, kind= 'reg', height= 2) # sns.pairplot(df, hue='class')

plt.show();

sns.set(font_scale=1.2)

sns.pairplot(fuel2001_3, diag_kind= 'kde', height= 2)

plt.show();sns.set(font_scale=1.2)

sns.pairplot(fuel2001_3, diag_kind= 'kde', plot_kws={'alpha':0.2}, height= 2)

plt.show();

# 3–3 Correlation matrix & heat map [6]

plt.subplots(figsize=(10,8))

sns.set(font_scale=1.9)

matrix = np.triu(fuel2001_3.corr())

sns.heatmap(fuel2001_3.corr(), annot=True, mask=matrix, cmap= 'rocket');plt.subplots(figsize=(10,8))

sns.set(font_scale=1.9)

mask = np.tril(fuel2001_3.corr())

sns.heatmap(fuel2001_3.corr(), annot=True, mask=mask, cmap= 'viridis');

# 3–4 Scatter plot

`plt.subplots(figsize=(6,5)) `

sns.regplot(x='Dlic', y='Fuel', data = fuel2001_3);

plt.title('Scatter plot & regression');

plt.show();

# 3–5 Box plot & subplot

plt.figure(figsize=(2.5,5))

sns.boxplot(data= fuel2001_3[['Dlic','Fuel']])

plt.show();plt.subplots(figsize=(4,5))

sns.boxplot(data= fuel2001_3[['Tax','Income_2','log_Miles']])

plt.show();plt.subplots(figsize=(8,5))

plt.subplot(121)

sns.boxplot(data= fuel2001_3[['Dlic','Fuel']])

plt.subplot(122)

sns.boxplot(data= fuel2001_3[['Tax','Income_2','log_Miles']])

plt.show();

# (4) Linear Regression

# 4– 1 Information criteria: AIC vs. BIC

- AIC = n log(RSSpc/n) + 2 Pc
- BIC = n log(RSSpc/n) + log(n) Pc
- AIC is given by Sakamoto et al. (1986); BIC is given by Schwarz (1978).
- Smaller values are preferred for AIC & BIC.
- Information criteria provide balance between
and*lack of fit*by complexity terms, which are essentially penalties preventing model from overfitting.*complexity*

Apparently, when the ** sample size n** increases, the penalty of AIC would be relatively small and thus can’t reduce the complexity of model. In this scenario, BIC functions better.

# 4–2 LR [2]

import pandas as pd

import statsmodels.formula.api as smLR_fuel2001 = sm.ols(formula = " Fuel ~ Tax + Dlic + Income_2 + log_Miles", data = fuel2001_3)

print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)

BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))

print("BIC = {}".format(BIC))

- AIC = 575.086
- BIC = 584.7451

# 4–3 LR with interactions

LR_fuel2001 = sm.ols(formula = " Fuel ~ Tax + Dlic + Tax:Dlic + Income_2 + log_Miles", data = fuel2001_3)

print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)

BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))

print("BIC = {}".format(BIC))

- AIC = 573.6349
- BIC = 585.2259

# 4–4 LR with interactions & higher-degree terms

LR_fuel2001 = sm.ols(formula = " Fuel ~ Tax + Dlic + Tax:Dlic + Income_2 + log_Miles + log_Miles + I(log_Miles**2) + I(log_Miles**3)", data = fuel2001_3)

print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)

BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))

print("BIC = {}".format(BIC))

- AIC = 573.6349
- BIC = 585.2259

# 4–5 Run the best model given by R’s step( ) in ML20

LR_fuel2001 = sm.ols(formula = "Fuel ~ I(Income_2**3) + Dlic + log_Miles + I(Tax**3) + I(Dlic**2)", data = fuel2001_3)

print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)

BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))

print("BIC = {}".format(BIC))

- AIC = 571.7994
- BIC = 583.3903
**Python****doesn’t have a handy stepwise LR functions as R does**, so personally I would like to turn to R for help [5][8][10][11][12][13][14][15][19][20].- Here we run the best model given by R’s step( ) in ML20:

https://merscliche.medium.com/ml20-abb54a435b3 - This best model of R’s step( ) is indeed better than we previously get, though the calculations behind AIC of Python & AIC of R are incompatible.

# (5) Summary

- Overall, we reproduce in Python the same results as ML20 does in R.
- In general, steps of data analysis process [2] are comprised of

- Ideation
- Retrieval
- Preparation
- Exploration
- Modeling
- Presentation
- Reproduction

And we carry out from preprocessing, exploration to stepwise linear regression. In other words, we do from retrieval, preparation, exploration, modeling to presentation in this article.

3. We didn’t experience the power of **stepwise regression with interactions and higher-degree terms (degree>1)**, which is rarely mentioned in ML/DS books or articles on the Internet, since **Python doesn’t have a handy stepwise LR functions as R does** , so personally I would like to turn to R for help when conducting linear regression [5][8][10][11][12][13][14][15][19][20].

# (6) References

[1] McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). California, CA: O’Reilly Media.

[2] Lander J.P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Massachusetts, MA: Addison-Wesley Professional.

[3] Heydt, M. (2017). Learning pandas: High-performance data manipulation and analysis in Python (2nd ed.). Birmingham, UK: Packt Publishing.

[4] Weisberg, S. (2014). Applied Linear Regression (4th ed.). New Jersey, NJ: John Wiley & Sons.

[5] Ramkumar, A. (2020). A Beginner’s Guide to Stepwise Multiple Linear Regression. Retrieved from

[6] Anita, O.(2019). Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations. Retrieved from

[7] Koehrsen, W.(2018). Visualizing Data with Pairs Plots in Python. Retrieved from

[8] 1313e (2016). Selecting the best combination of variables for regression model based on reg score. Retrieved from

[9] Hepner, T. (2016). R summary() equivalent in numpy. Retrieved from

[10] Schumacher, A. (2015). Stepwise Regression in Python. Retrieved from

[11] [Unidentified] (2015). Forward Selection with statsmodels. Retrieved from

[12] Prettenhofer, P. (2014). Multiple Regression Using Statsmodels. Retrieved from

[13] PyPI (Unidentified). stepwise-regression 1.0.3. Retrieved from

[14] DataSklr (Unidentified). Feature Selection with Python. Retrieved from

[15] jcrouser (Unidentified). Subset Selection in Python. Retrieved from

[16] RDocumentation (Unidentified). fuel2001: Fuel Consumption. Retrieved from

[17] [Unidentified] (Unidentified). Stepwise regression in Python. Retrieved from

## (Chinese)

[18] 钱魏Way (2020)。最优模型选择准则：AIC和BIC。取自

https://www.biaodianfu.com/aic-bic.html

[19] shangyj17 (2018)。python-长数据完整打印方法。取自

[20] pku_xfy (2017)。请问python可以做逐步回归（stepwise regression）吗？取自 https://bit.ly/3oX92Os