Analytics Vidhya
Published in

Analytics Vidhya

ML21: Linear Regression with Python

With higher-degree terms & interactions

Keywords: linear regression, higher-degree terms, interactions, AIC, BIC, correlation heat map, scatter plot

Complete Python code on Colab: https://bit.ly/39CEuve

  • Linear regression is an essential yet often underrated model in ML.
  • LR offers a quick walk-through in preparation for implementing more sophisticated ML modeling and more complex analysis.
  • Furthermore, LR can serves as the baseline model to evaluate the performances of more sophisticated ML models.

Based on ML20, which use R to do a chain of analysis and reach stepwise linear regression in the end, we try to reproduce the outcomes of ML20 in Python. Also, the reader may check ML19 for more prior knowledge.

We can find most of the corresponding functions in Python; however, we can’t find a handy stepwise LR function in Python [5][8][10][11][12][13][14][15][17][20].

We use Python to handle the toy dataset “fuel2001” (fuel data) given by “Applied Linear Regression (4th ed.)” [4]. In Chapter 1 & Chapter 3, this notable textbook leverage this toy dataset to concisely demonstrate scatter plot matrix, summary statistics, correlation matrix and multiple linear regression via R.

Outline

(1) Data Source: Fuel Consumption

(2) Data Preprocessing
2–1 Deleting a column
2–2 Transformation
2–3 Retain needed columns

(3) Data Exploration
3–1 Descriptive statistics
3–2 Scatter plot matrix
3–3 Correlation matrix & heat map
3–4 Scatter plot
3–5 Box plot & subplot

(4) Linear Regression
4–1 Information criteria: AIC vs. BIC
4–2 LR
4–3 LR with interactions
4–4 LR with interactions & higher-degree terms
4–5 Run the best model given by R’s step( ) in ML20

(5) Summary
(6) References

(1) Data Source: Fuel Consumption [4]

The reader could download the “fuel2001.csv” dataset below.
http://users.stat.umn.edu/~sandy/alr4ed/data/

Fuel Consumption
The goal of this example is to understand how fuel consumption varies over
the 50 United States and the District of Columbia (Federal Highway Administration, 2001). Table 1.1 describes the variables to be used in this example; the data are given in the file fuel2001. The data were collected by the U.S. Federal Highway Administration.

  • 1. Shape: 51 observations of 7 variables
  • 2. Target: “FuelC”
  • 3. Features: “Drivers”, “Income”, “Miles”, “MPC”, “Pop”, “Tax”

(2) Data Preprocessing [3]

2–1 Deleting a column

fuel2001 = pd.read_csv("fuel2001.csv")fuel2001_2 = fuel2001.drop(['Unnamed: 0'], axis = 1)
fuel2001_2.head(n=3)

We find a odd and useless column, so we remove it.

2–2 Transformation

import numpy as npDlic= 1000 * fuel2001_2.Drivers / fuel2001_2.Pop
Fuel= 1000 * fuel2001_2.FuelC / fuel2001_2.Pop
Income_2 = fuel2001_2.Income / 1000
log_Miles = np.log(fuel2001_2.Miles)
fuel2001_2['Dlic'] = Dlic
fuel2001_2['Fuel'] = Fuel
fuel2001_2['Income_2'] = Income_2
fuel2001_2['log_Miles'] = log_Miles

We do feature engineering, adding new variables.

2–3 Retaining needed columns

fuel2001_3 =  fuel2001_2.iloc[ : , [6,7,8,9,10] ] # new data frame

Now the information of “fuel2001_4” is as follows:

  • 1. Shape: 51 observations of 5 variables
  • 2. Target: “Fuel”
  • 3. Features: “Tax”, “Dlic”, “Income_2”, “log_Miles”

(3) Data Exploration

3–1 Descriptive statistics [9][19]

print(fuel2001_3.shape)
print(len(fuel2001_3))
print(fuel2001_3.index)
print(fuel2001_3.columns)
print(fuel2001_3.head(n=3))
print(fuel2001_3.tail(n=3))
print(fuel2001_3.info())summary = fuel2001_3.describe()
print(summary)
type(summary_2) # type()# Display all columns
pd.set_option('display.max_columns', None)
# Display all rows
pd.set_option('display.max_rows', None)
summary_2 = fuel2001_3.describe().transpose()
print(summary_2)

3–2 Scatter plot matrix [1][7]

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.2)
sns.pairplot(fuel2001_3, kind= 'scatter', height= 2)
plt.show();
sns.set(font_scale=1.2)
sns.pairplot(fuel2001_3, kind= 'reg', height= 2) # sns.pairplot(df, hue='class')
plt.show();
sns.set(font_scale=1.2)
sns.pairplot(fuel2001_3, diag_kind= 'kde', height= 2)
plt.show();
sns.set(font_scale=1.2)
sns.pairplot(fuel2001_3, diag_kind= 'kde', plot_kws={'alpha':0.2}, height= 2)
plt.show();

3–3 Correlation matrix & heat map [6]

plt.subplots(figsize=(10,8))
sns.set(font_scale=1.9)
matrix = np.triu(fuel2001_3.corr())
sns.heatmap(fuel2001_3.corr(), annot=True, mask=matrix, cmap= 'rocket');
plt.subplots(figsize=(10,8))
sns.set(font_scale=1.9)
mask = np.tril(fuel2001_3.corr())
sns.heatmap(fuel2001_3.corr(), annot=True, mask=mask, cmap= 'viridis');

3–4 Scatter plot

plt.subplots(figsize=(6,5))   
sns.regplot(x='Dlic', y='Fuel', data = fuel2001_3);
plt.title('Scatter plot & regression');
plt.show();

3–5 Box plot & subplot

plt.figure(figsize=(2.5,5)) 
sns.boxplot(data= fuel2001_3[['Dlic','Fuel']])
plt.show();
plt.subplots(figsize=(4,5))
sns.boxplot(data= fuel2001_3[['Tax','Income_2','log_Miles']])
plt.show();
plt.subplots(figsize=(8,5))
plt.subplot(121)
sns.boxplot(data= fuel2001_3[['Dlic','Fuel']])
plt.subplot(122)
sns.boxplot(data= fuel2001_3[['Tax','Income_2','log_Miles']])
plt.show();

(4) Linear Regression

4– 1 Information criteria: AIC vs. BIC

  • AIC = n log(RSSpc/n) + 2 Pc
  • BIC = n log(RSSpc/n) + log(n) Pc
  • AIC is given by Sakamoto et al. (1986); BIC is given by Schwarz (1978).
  • Smaller values are preferred for AIC & BIC.
  • Information criteria provide balance between lack of fit and complexity by complexity terms, which are essentially penalties preventing model from overfitting.

Apparently, when the sample size n increases, the penalty of AIC would be relatively small and thus can’t reduce the complexity of model. In this scenario, BIC functions better.

4–2 LR [2]

import pandas as pd
import statsmodels.formula.api as sm
LR_fuel2001 = sm.ols(formula = " Fuel ~ Tax + Dlic + Income_2 + log_Miles", data = fuel2001_3)
print(LR_fuel2001.fit().summary())
AIC = np.round(LR_fuel2001.fit().aic, decimals=4)
BIC = np.round(LR_fuel2001.fit().bic, decimals=4)
print("AIC = {}".format(AIC))
print("BIC = {}".format(BIC))
  • AIC = 575.086
  • BIC = 584.7451

4–3 LR with interactions

LR_fuel2001 = sm.ols(formula = " Fuel ~  Tax + Dlic + Tax:Dlic + Income_2 + log_Miles", data = fuel2001_3)
print(LR_fuel2001.fit().summary())
AIC = np.round(LR_fuel2001.fit().aic, decimals=4)
BIC = np.round(LR_fuel2001.fit().bic, decimals=4)
print("AIC = {}".format(AIC))
print("BIC = {}".format(BIC))
  • AIC = 573.6349
  • BIC = 585.2259

4–4 LR with interactions & higher-degree terms

LR_fuel2001 = sm.ols(formula = " Fuel ~  Tax + Dlic + Tax:Dlic + Income_2 + log_Miles + log_Miles +  I(log_Miles**2) + I(log_Miles**3)", data = fuel2001_3)
print(LR_fuel2001.fit().summary())
AIC = np.round(LR_fuel2001.fit().aic, decimals=4)
BIC = np.round(LR_fuel2001.fit().bic, decimals=4)
print("AIC = {}".format(AIC))
print("BIC = {}".format(BIC))
  • AIC = 573.6349
  • BIC = 585.2259

4–5 Run the best model given by R’s step( ) in ML20

LR_fuel2001 = sm.ols(formula = "Fuel ~ I(Income_2**3) + Dlic + log_Miles + I(Tax**3) + I(Dlic**2)", data = fuel2001_3)
print(LR_fuel2001.fit().summary())
AIC = np.round(LR_fuel2001.fit().aic, decimals=4)
BIC = np.round(LR_fuel2001.fit().bic, decimals=4)
print("AIC = {}".format(AIC))
print("BIC = {}".format(BIC))
  • AIC = 571.7994
  • BIC = 583.3903
  • Python doesn’t have a handy stepwise LR functions as R does , so personally I would like to turn to R for help [5][8][10][11][12][13][14][15][19][20].
  • Here we run the best model given by R’s step( ) in ML20:
    https://merscliche.medium.com/ml20-abb54a435b3
  • This best model of R’s step( ) is indeed better than we previously get, though the calculations behind AIC of Python & AIC of R are incompatible.

(5) Summary

  1. Overall, we reproduce in Python the same results as ML20 does in R.
  2. In general, steps of data analysis process [2] are comprised of
  • Ideation
  • Retrieval
  • Preparation
  • Exploration
  • Modeling
  • Presentation
  • Reproduction

And we carry out from preprocessing, exploration to stepwise linear regression. In other words, we do from retrieval, preparation, exploration, modeling to presentation in this article.

3. We didn’t experience the power of stepwise regression with interactions and higher-degree terms (degree>1), which is rarely mentioned in ML/DS books or articles on the Internet, since Python doesn’t have a handy stepwise LR functions as R does , so personally I would like to turn to R for help when conducting linear regression [5][8][10][11][12][13][14][15][19][20].

(6) References

[1] McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). California, CA: O’Reilly Media.

[2] Lander J.P. (2017). R for Everyone: Advanced Analytics and Graphics (2nd ed.). Massachusetts, MA: Addison-Wesley Professional.

[3] Heydt, M. (2017). Learning pandas: High-performance data manipulation and analysis in Python (2nd ed.). Birmingham, UK: Packt Publishing.

[4] Weisberg, S. (2014). Applied Linear Regression (4th ed.). New Jersey, NJ: John Wiley & Sons.

[5] Ramkumar, A. (2020). A Beginner’s Guide to Stepwise Multiple Linear Regression. Retrieved from

[6] Anita, O.(2019). Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations. Retrieved from

[7] Koehrsen, W.(2018). Visualizing Data with Pairs Plots in Python. Retrieved from

[8] 1313e (2016). Selecting the best combination of variables for regression model based on reg score. Retrieved from

[9] Hepner, T. (2016). R summary() equivalent in numpy. Retrieved from

[10] Schumacher, A. (2015). Stepwise Regression in Python. Retrieved from

[11] [Unidentified] (2015). Forward Selection with statsmodels. Retrieved from

[12] Prettenhofer, P. (2014). Multiple Regression Using Statsmodels. Retrieved from

[13] PyPI (Unidentified). stepwise-regression 1.0.3. Retrieved from

[14] DataSklr (Unidentified). Feature Selection with Python. Retrieved from

[15] jcrouser (Unidentified). Subset Selection in Python. Retrieved from

[16] RDocumentation (Unidentified). fuel2001: Fuel Consumption. Retrieved from

[17] [Unidentified] (Unidentified). Stepwise regression in Python. Retrieved from

(Chinese)

[18] 钱魏Way (2020)。最优模型选择准则:AIC和BIC。取自
https://www.biaodianfu.com/aic-bic.html

[19] shangyj17 (2018)。python-长数据完整打印方法。取自

[20] pku_xfy (2017)。请问python可以做逐步回归(stepwise regression)吗?取自 https://bit.ly/3oX92Os

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yu-Cheng (Morton) Kuo

Yu-Cheng (Morton) Kuo

ML/DS using Python & R. A Taiwanese earned MBA from NCCU and BS from NTHU with MATH major & ECON minor. Email: morton.kuo.28@gmail.com