# ML21: Linear Regression with Python

## With higher-degree terms & interactions

Keywords: linear regression, higher-degree terms, interactions, AIC, BIC, correlation heat map, scatter plot

Complete Python code on Colab: https://bit.ly/39CEuve

• Linear regression is an essential yet often underrated model in ML.
• LR offers a quick walk-through in preparation for implementing more sophisticated ML modeling and more complex analysis.
• Furthermore, LR can serves as the baseline model to evaluate the performances of more sophisticated ML models.

Based on ML20, which use R to do a chain of analysis and reach stepwise linear regression in the end, we try to reproduce the outcomes of ML20 in Python. Also, the reader may check ML19 for more prior knowledge.

We can find most of the corresponding functions in Python; however, we can’t find a handy stepwise LR function in Python .

We use Python to handle the toy dataset “fuel2001” (fuel data) given by “Applied Linear Regression (4th ed.)” . In Chapter 1 & Chapter 3, this notable textbook leverage this toy dataset to concisely demonstrate scatter plot matrix, summary statistics, correlation matrix and multiple linear regression via R.

Outline

(2) Data Preprocessing
2–1 Deleting a column
2–2 Transformation
2–3 Retain needed columns

(3) Data Exploration
3–1 Descriptive statistics
3–2 Scatter plot matrix
3–3 Correlation matrix & heat map
3–4 Scatter plot
3–5 Box plot & subplot

(4) Linear Regression
4–1 Information criteria: AIC vs. BIC
4–2 LR
4–3 LR with interactions
4–4 LR with interactions & higher-degree terms
4–5 Run the best model given by R’s step( ) in ML20

(5) Summary
(6) References

# (1) Data Source: Fuel Consumption 

http://users.stat.umn.edu/~sandy/alr4ed/data/

Fuel Consumption
The goal of this example is to understand how fuel consumption varies over
the 50 United States and the District of Columbia (Federal Highway Administration, 2001). Table 1.1 describes the variables to be used in this example; the data are given in the file fuel2001. The data were collected by the U.S. Federal Highway Administration.

• 1. Shape: 51 observations of 7 variables
• 2. Target: “FuelC”
• 3. Features: “Drivers”, “Income”, “Miles”, “MPC”, “Pop”, “Tax”

# 2–1 Deleting a column

`fuel2001 = pd.read_csv("fuel2001.csv")fuel2001_2 = fuel2001.drop(['Unnamed: 0'], axis = 1)fuel2001_2.head(n=3)`

We find a odd and useless column, so we remove it.

# 2–2 Transformation

`import numpy as npDlic= 1000 * fuel2001_2.Drivers / fuel2001_2.PopFuel= 1000 * fuel2001_2.FuelC / fuel2001_2.PopIncome_2 = fuel2001_2.Income / 1000log_Miles = np.log(fuel2001_2.Miles)fuel2001_2['Dlic'] = Dlicfuel2001_2['Fuel'] = Fuelfuel2001_2['Income_2'] = Income_2fuel2001_2['log_Miles'] = log_Miles`

We do feature engineering, adding new variables.

# 2–3 Retaining needed columns

`fuel2001_3 =  fuel2001_2.iloc[ : , [6,7,8,9,10] ] # new data frame`

Now the information of “fuel2001_4” is as follows:

• 1. Shape: 51 observations of 5 variables
• 2. Target: “Fuel”
• 3. Features: “Tax”, “Dlic”, “Income_2”, “log_Miles”

# 3–1 Descriptive statistics 

`print(fuel2001_3.shape)print(len(fuel2001_3))print(fuel2001_3.index)print(fuel2001_3.columns)print(fuel2001_3.head(n=3))print(fuel2001_3.tail(n=3))print(fuel2001_3.info())summary = fuel2001_3.describe()print(summary)type(summary_2) # type()# Display all columns pd.set_option('display.max_columns', None)# Display all rowspd.set_option('display.max_rows', None)summary_2 = fuel2001_3.describe().transpose()print(summary_2)`

# 3–2 Scatter plot matrix 

`import matplotlib.pyplot as pltimport seaborn as snssns.set(font_scale=1.2)sns.pairplot(fuel2001_3, kind= 'scatter', height= 2) plt.show();sns.set(font_scale=1.2)sns.pairplot(fuel2001_3, kind= 'reg', height= 2)  # sns.pairplot(df, hue='class')  plt.show();`
`sns.set(font_scale=1.2)sns.pairplot(fuel2001_3, diag_kind= 'kde', height= 2) plt.show();sns.set(font_scale=1.2)sns.pairplot(fuel2001_3, diag_kind= 'kde', plot_kws={'alpha':0.2}, height= 2) plt.show();`

# 3–3 Correlation matrix & heat map 

`plt.subplots(figsize=(10,8))sns.set(font_scale=1.9) matrix = np.triu(fuel2001_3.corr())sns.heatmap(fuel2001_3.corr(), annot=True, mask=matrix, cmap= 'rocket');plt.subplots(figsize=(10,8))   sns.set(font_scale=1.9) mask = np.tril(fuel2001_3.corr())sns.heatmap(fuel2001_3.corr(), annot=True, mask=mask, cmap= 'viridis');`

# 3–4 Scatter plot

`plt.subplots(figsize=(6,5))   sns.regplot(x='Dlic', y='Fuel', data = fuel2001_3);plt.title('Scatter plot & regression');plt.show();`

# 3–5 Box plot & subplot

`plt.figure(figsize=(2.5,5)) sns.boxplot(data= fuel2001_3[['Dlic','Fuel']])plt.show();plt.subplots(figsize=(4,5)) sns.boxplot(data= fuel2001_3[['Tax','Income_2','log_Miles']])plt.show();plt.subplots(figsize=(8,5))plt.subplot(121)sns.boxplot(data= fuel2001_3[['Dlic','Fuel']])plt.subplot(122)sns.boxplot(data= fuel2001_3[['Tax','Income_2','log_Miles']])plt.show();`

# 4– 1 Information criteria: AIC vs. BIC

• AIC = n log(RSSpc/n) + 2 Pc
• BIC = n log(RSSpc/n) + log(n) Pc
• AIC is given by Sakamoto et al. (1986); BIC is given by Schwarz (1978).
• Smaller values are preferred for AIC & BIC.
• Information criteria provide balance between lack of fit and complexity by complexity terms, which are essentially penalties preventing model from overfitting.

Apparently, when the sample size n increases, the penalty of AIC would be relatively small and thus can’t reduce the complexity of model. In this scenario, BIC functions better.

# 4–2 LR 

`import pandas as pdimport statsmodels.formula.api as smLR_fuel2001 = sm.ols(formula = " Fuel ~  Tax + Dlic + Income_2 + log_Miles", data = fuel2001_3)print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))print("BIC = {}".format(BIC))`
• AIC = 575.086
• BIC = 584.7451

# 4–3 LR with interactions

`LR_fuel2001 = sm.ols(formula = " Fuel ~  Tax + Dlic + Tax:Dlic + Income_2 + log_Miles", data = fuel2001_3)print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))print("BIC = {}".format(BIC))`
• AIC = 573.6349
• BIC = 585.2259

# 4–4 LR with interactions & higher-degree terms

`LR_fuel2001 = sm.ols(formula = " Fuel ~  Tax + Dlic + Tax:Dlic + Income_2 + log_Miles + log_Miles +  I(log_Miles**2) + I(log_Miles**3)", data = fuel2001_3)print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))print("BIC = {}".format(BIC))`
• AIC = 573.6349
• BIC = 585.2259

# 4–5 Run the best model given by R’s step( ) in ML20

`LR_fuel2001 = sm.ols(formula = "Fuel ~ I(Income_2**3) + Dlic + log_Miles + I(Tax**3) + I(Dlic**2)", data = fuel2001_3)print(LR_fuel2001.fit().summary())AIC = np.round(LR_fuel2001.fit().aic, decimals=4)BIC = np.round(LR_fuel2001.fit().bic, decimals=4)print("AIC = {}".format(AIC))print("BIC = {}".format(BIC))`
• AIC = 571.7994
• BIC = 583.3903
• Python doesn’t have a handy stepwise LR functions as R does , so personally I would like to turn to R for help .
• Here we run the best model given by R’s step( ) in ML20:
https://merscliche.medium.com/ml20-abb54a435b3
• This best model of R’s step( ) is indeed better than we previously get, though the calculations behind AIC of Python & AIC of R are incompatible.

# (5) Summary

1. Overall, we reproduce in Python the same results as ML20 does in R.
2. In general, steps of data analysis process  are comprised of
• Ideation
• Retrieval
• Preparation
• Exploration
• Modeling
• Presentation
• Reproduction

And we carry out from preprocessing, exploration to stepwise linear regression. In other words, we do from retrieval, preparation, exploration, modeling to presentation in this article.

3. We didn’t experience the power of stepwise regression with interactions and higher-degree terms (degree>1), which is rarely mentioned in ML/DS books or articles on the Internet, since Python doesn’t have a handy stepwise LR functions as R does , so personally I would like to turn to R for help when conducting linear regression .

# (6) References

 McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). California, CA: O’Reilly Media.

 Lander J.P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Massachusetts, MA: Addison-Wesley Professional.

 Heydt, M. (2017). Learning pandas: High-performance data manipulation and analysis in Python (2nd ed.). Birmingham, UK: Packt Publishing.

 Weisberg, S. (2014). Applied Linear Regression (4th ed.). New Jersey, NJ: John Wiley & Sons.

 Ramkumar, A. (2020). A Beginner’s Guide to Stepwise Multiple Linear Regression. Retrieved from

 Anita, O.(2019). Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations. Retrieved from

 Koehrsen, W.(2018). Visualizing Data with Pairs Plots in Python. Retrieved from

 1313e (2016). Selecting the best combination of variables for regression model based on reg score. Retrieved from

 Hepner, T. (2016). R summary() equivalent in numpy. Retrieved from

 Schumacher, A. (2015). Stepwise Regression in Python. Retrieved from

 [Unidentified] (2015). Forward Selection with statsmodels. Retrieved from

 Prettenhofer, P. (2014). Multiple Regression Using Statsmodels. Retrieved from

 PyPI (Unidentified). stepwise-regression 1.0.3. Retrieved from

 jcrouser (Unidentified). Subset Selection in Python. Retrieved from

 RDocumentation (Unidentified). fuel2001: Fuel Consumption. Retrieved from

 [Unidentified] (Unidentified). Stepwise regression in Python. Retrieved from

## (Chinese)

 钱魏Way (2020)。最优模型选择准则：AIC和BIC。取自
https://www.biaodianfu.com/aic-bic.html

 shangyj17 (2018)。python-长数据完整打印方法。取自

 pku_xfy (2017)。请问python可以做逐步回归（stepwise regression）吗？取自 https://bit.ly/3oX92Os

--

--