Exploring Airline Passenger Satisfaction

Abstract

Published in

AI Skunks

15 min readApr 26, 2023

The dataset contains information collected from passengers of an airline including their level of satisfaction with various aspects of the flight experience such as seat comfort, in-flight entertainment, and food service. This information can be used to train a machine learning model to predict passenger satisfaction based on various factors. The model can then be used by the airline to improve their service and increase passenger satisfaction. Additionally, the dataset can be used to identify patterns and trends in passenger satisfaction, which can inform decisions on where to allocate resources and make improvements.

Data Preprocessing Tools

Data preprocessing is a critical step in the data analysis process that involves preparing raw data for analysis by transforming, cleaning, and organizing it into a usable format. The quality of the preprocessing step can significantly impact the accuracy and reliability of the results obtained from the data analysis.

The goal of data preprocessing is to make the data suitable for machine learning algorithms, statistical analysis, or other data mining techniques.

a) Importing the libraries

!pip install shap
!pip install lime 
!pip install h2o
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from lime.lime_tabular import LimeTabularExplainer
import h2o
from h2o.automl import H2OAutoML
h2o.init()
import warnings
warnings.filterwarnings('ignore')
seed=1234
# getting the train data from github
# the dataset consists of two parts train dataset and test dataset.
# dimensions 103904 rows * 23 columns
print("")
print("*"*100,"Train Data","*"*100)
print("")
df_train = pd.read_csv('https://raw.githubusercontent.com/mananrg/DataScience/main/train.csv',index_col=0)
df_train=df_train.drop('id',axis=1)
print(df_train.head())

print(f"Length / No of rows in train set: {df_train.shape[0]}")
print(f"Width / No of columns in train set: {df_train.shape[1]}")
print("")
print("*"*100,"Test Data","*"*100)
print("")
# getting the test data from github
# dimensions 25976 rows * 23 columns
df_test = pd.read_csv('https://raw.githubusercontent.com/mananrg/DataScience/main/test.csv',index_col=0)
df_test=df_test.drop('id',axis=1)
print(df_test.head())
print(f"Length / No of rows in test set: {df_test.shape[0]}")
print(f"Width / No of columns in test set: {df_test.shape[1]}")

print(df_train.info())
print("-"*200)
numCols = df_train.select_dtypes("number").columns
catCols = df_train.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
print(f"Numerical Columns: {numCols}")
print(f"Categorical Columns: {catCols}")

b) Taking care of missing data

df_train=df_train.dropna(axis=0, how='any', subset=None, inplace=False)
df_test=df_test.dropna(axis=0, how='any', subset=None, inplace=False)

#checking the distribution of independent variables
from statsmodels.graphics.gofplots import qqplot
data_norm=df_train[['Flight Distance', 'Inflight wifi service', 'Online boarding', 'Arrival Delay in Minutes', 'Departure/Arrival time convenient', 'Gate location', 'Departure Delay in Minutes', 'Food and drink', 'Checkin service', 'Age', 'Seat comfort', 'Cleanliness', 'On-board service', 'Leg room service', 'Baggage handling', 'Ease of Online booking', 'Inflight service', 'Inflight entertainment']]
for c in data_norm.columns[:]:
 plt.figure(figsize=(8,5))
 fig=qqplot(data_norm[c],line='45',fit='True')
 plt.xticks(fontsize=13)
 plt.yticks(fontsize=13)
 plt.xlabel("Theoretical quantiles",fontsize=15)
 plt.ylabel("Sample quantiles",fontsize=15)
 plt.title("Q-Q plot of {}".format(c),fontsize=16)
 plt.grid(True)
 plt.show()

#the heat map of the correlation
import seaborn as sns
plt.figure(figsize=(20,7))
sns.heatmap(df_train.corr(), annot=True, cmap='RdYlGn')

sns.pairplot(df_train)

c) Feature Selection¶

# Excluding the id column and the satisfaction column 
# Why ID? In Machine Learning the ID is not relevant as it doesnot signify any information
# Why Satisfaction? Because its the result and will be in y
# df.iloc[X,y] where X= ':' can also be represented by/also means 0th row:last row 
X_train = df_train.iloc[:, 0:-1]# from column 0 to second last column[-1] means dropping id and satisfaction column
y_train = df_train.iloc[:, -1] # -1 indicates last column
X_test = df_test.iloc[:, 0:-1] # from column 0 to second last column[-1] means dropping id and satisfaction column
y_test = df_test.iloc[:, -1] # -1 indicates last column

# Printing the first row in dataset
print(X_test.head().to_markdown())
# Printing the outcome of the first row
print(y_test.head().to_markdown())

d) Encoding categorical data

Encoding the Independent Variable

One-hot encoding and dummy encoding are two methods used to represent categorical variables in a dataset for machine learning.

“df_dummies” likely refers to a data frame that has undergone one-hot encoding, a process of converting categorical data into numerical data. In this encoding, each unique category value is represented as a binary vector, where only one element is 1 (hot) and the rest are 0 (cold).
In this method, I will be using DF Dummies

print(X_train.head())
print(X_test.head())

2) Encoding the Dependent Variable

# Here the y_true value is a string[neutral or dissatisfied/ satisfied] and not an integer[0,1] hence we need to encode it. This step can be avoided if your y_true is in integer form

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)
print(y_train[0])
print(y_test[0])

e) Feature Scaling
MinMaxScaler is a pre-processing method in machine learning that transforms a feature set to a given range, typically between 0 and 1. This scaling is done by subtracting the minimum value of the feature from each data point and dividing the result by the range (max value minus min value). The purpose of this normalization is to bring all features to the same scale, which can be beneficial for algorithms that weight inputs, such as neural networks. This helps to ensure that no one feature dominates the others in terms of magnitude, which can improve the performance of the model.

print(X_train.head())
plt.figure(figsize=(20,7))
sns.boxplot(data=X_train)

from sklearn import preprocessing
# Preparing for normalizing
min_max_scaler = preprocessing.MinMaxScaler()
# Transform the data to fit minmax processor
#x_scaled = min_max_scaler.fit_transform(X_train)
# Run the normalizer on the dataframe
X_train[['Age', 'Flight Distance', 'Inflight wifi service',
 'Departure/Arrival time convenient', 'Ease of Online booking',
 'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
 'Inflight entertainment', 'On-board service', 'Leg room service',
 'Baggage handling', 'Checkin service', 'Inflight service',
 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
 'Gender_Female', 'Gender_Male', 'Customer Type_Loyal Customer',
 'Customer Type_disloyal Customer', 'Type of Travel_Business travel',
 'Type of Travel_Personal Travel', 'Class_Business', 'Class_Eco',
 'Class_Eco Plus']] =min_max_scaler.fit_transform(X_train[['Age', 'Flight Distance', 'Inflight wifi service',
 'Departure/Arrival time convenient', 'Ease of Online booking',
 'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
 'Inflight entertainment', 'On-board service', 'Leg room service',
 'Baggage handling', 'Checkin service', 'Inflight service',
 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
 'Gender_Female', 'Gender_Male', 'Customer Type_Loyal Customer',
 'Customer Type_disloyal Customer', 'Type of Travel_Business travel',
 'Type of Travel_Personal Travel', 'Class_Business', 'Class_Eco',
 'Class_Eco Plus']])
X_test[['Age', 'Flight Distance', 'Inflight wifi service',
 'Departure/Arrival time convenient', 'Ease of Online booking',
 'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
 'Inflight entertainment', 'On-board service', 'Leg room service',
 'Baggage handling', 'Checkin service', 'Inflight service',
 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
 'Gender_Female', 'Gender_Male', 'Customer Type_Loyal Customer',
 'Customer Type_disloyal Customer', 'Type of Travel_Business travel',
 'Type of Travel_Personal Travel', 'Class_Business', 'Class_Eco',
 'Class_Eco Plus']] =min_max_scaler.transform(X_test[['Age', 'Flight Distance', 'Inflight wifi service',
 'Departure/Arrival time convenient', 'Ease of Online booking',
 'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
 'Inflight entertainment', 'On-board service', 'Leg room service',
 'Baggage handling', 'Checkin service', 'Inflight service',
 'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
 'Gender_Female', 'Gender_Male', 'Customer Type_Loyal Customer',
 'Customer Type_disloyal Customer', 'Type of Travel_Business travel',
 'Type of Travel_Personal Travel', 'Class_Business', 'Class_Eco',
 'Class_Eco Plus']])

plt.figure(figsize=(20,7))
sns.boxplot(data=X_train)

Logistic Regression

Logistic regression is a type of binary classification algorithm used to predict the probability of a categorical outcome based on one or more predictor variables. It is a popular machine learning algorithm used for various applications, such as medical diagnosis, credit scoring, and marketing.

The goal of logistic regression is to find a function that can map the input variables to the probability of a binary outcome (usually 0 or 1). The output of logistic regression is a probability value between 0 and 1, which can be converted into a binary outcome using a decision threshold.

Logistic regression works by estimating the coefficients of a linear equation, known as the logit or logistic function. The logit function is defined as the natural logarithm of the odds of the binary outcome. The coefficients are estimated using maximum likelihood estimation or other optimization techniques.

The logistic regression algorithm assumes that the relationship between the input variables and the output variable is linear. It also assumes that the errors are normally distributed and the variance is constant across all levels of the predictor variables.

Logistic regression can handle both categorical and continuous predictor variables and can be extended to handle multi-class classification problems using techniques such as one-vs-all or softmax regression. It is a simple and interpretable algorithm that can be easily implemented in various programming languages and statistical software packages. However, it is not suitable for handling nonlinear relationships between the input variables and the output variable.

from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

y_test_pred=classifier.predict(X_test)

from sklearn.metrics import confusion_matrix,mean_squared_error
cm = confusion_matrix(y_test, y_test_pred)
print(cm)
mse = mean_squared_error(y_test, y_test_pred)
print(f"Mean Squared Error: {mse}")

fig, ax = plt.subplots(figsize=(7.5, 7.5))
ax.matshow(cm, cmap=plt.cm.gist_heat, alpha=0.3)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(x=j, y=i,s=cm[i, j], va='center', ha='center', size='xx-large')
 
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#y_test and y_pred are the true labels and predicted labels respectively

acc = accuracy_score(y_test, y_test_pred)
prec = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

print("Accuracy: ", acc)
print("Precision: ", prec)
print("Recall: ", recall)
print("F1: ",f1)

# Select only the numeric variables
numeric_vars = df_train.select_dtypes(include=['float64', 'int64'])

# Plot the histograms
numeric_vars.hist(bins=50, figsize=(20,15))
plt.show()

from scipy import stats
from pprint import pprint

# Select only the numeric variables
numeric_vars = df_train.select_dtypes(include=['float64', 'int64'])
alpha = 0.05
di={}
# Run the Anderson-Darling test for each variable
for var in numeric_vars.columns:
    result = stats.anderson(numeric_vars[var], dist='norm')
    print(f"{var}: Test Statistic: {result.statistic}, p-value: {result.critical_values}")
    if result.statistic < result.critical_values[0]:
        di[var]="normal distribution"
        #print(f"{var} follows a normal distribution (fail to reject H0)")
    elif result.statistic < result.critical_values[1]:
        di[var]="normal distribution"
        #print(f"{var} may follow a normal distribution (fail to reject H0)")
    elif result.statistic < result.critical_values[2]:
        di[var]="skewed distribution"
        #print(f"{var} has a skewed distribution (reject H0)")
    elif result.statistic < result.critical_values[3]:
        di[var]="very skewed distribution"
        #print(f"{var} has a very skewed distribution (reject H0)")
    else:
        di[var]="extreme skewed distribution"
        #print(f"{var} has a extreme skewed distribution (reject H0)")
print("**********************************************************************")
pprint(di)

Tree Classifier

A decision tree classifier is a type of machine learning algorithm that is used to classify data points based on a set of rules. It works by building a tree-like model of decisions and their possible consequences. Each decision node in the tree represents a test on one or more input features, while each leaf node represents a class label or decision.

The algorithm starts by analyzing the training data and selecting the most important feature to split the data into two or more subsets. This process is repeated recursively on each subset until a stopping criterion is met, such as reaching a maximum depth or a minimum number of samples in a leaf node. The final result is a tree-like model that can be used to classify new data points by following the path from the root to a leaf node that matches the input features.

One advantage of decision tree classifiers is that they are easy to interpret and visualize, which can help in understanding the underlying decision-making process. They can also handle both categorical and numerical features and can be used for both classification and regression tasks.

However, decision trees are prone to overfitting, especially when the tree is too complex or the training data is noisy. To overcome this, various techniques such as pruning, regularization, and ensemble methods like Random Forests can be applied.

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

# X and y are your feature matrix and binary target variable, respectively
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
r = export_text(dtc, feature_names=X_train.columns.tolist())
print(r)

y_train=pd.DataFrame(y_train,columns=['satisfaction'])
y_test=pd.DataFrame(y_test,columns=['satisfaction'])

dependent_vars_train =y_train #data[['satisfaction_neutral or dissatisfied',

      # 'satisfaction_satisfied']]
independent_vars_train = X_train[['Age', 'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'Gender_Female', 'Gender_Male', 'Customer Type_Loyal Customer',
       'Customer Type_disloyal Customer', 'Type of Travel_Business travel',
       'Type of Travel_Personal Travel', 'Class_Business', 'Class_Eco',
       'Class_Eco Plus']].reset_index(drop=True)

dependent_vars_test =y_test #data[['satisfaction_neutral or dissatisfied',

      # 'satisfaction_satisfied']]
independent_vars_test = X_test[['Age', 'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'Gender_Female', 'Gender_Male', 'Customer Type_Loyal Customer',
       'Customer Type_disloyal Customer', 'Type of Travel_Business travel',
       'Type of Travel_Personal Travel', 'Class_Business', 'Class_Eco',
       'Class_Eco Plus']].reset_index(drop=True)
# Fitting the OLS model
results = sm.OLS(dependent_vars_train, independent_vars_train).fit()
p_values = results.summary2().tables[1]['P>|t|']
print(p_values.round(4))
print(results.summary())

df_train_concat=pd.concat([independent_vars_train,dependent_vars_train],axis=1)
print("*"*300)
df_test_concat=pd.concat([independent_vars_test,dependent_vars_test],axis=1)

df_train_h2o = h2o.H2OFrame(df_train_concat)
df_test_h2o = h2o.H2OFrame(df_test_concat)
print(df_train_h2o.describe)

y = 'satisfaction'
X_train_h2o = df_train_h2o.drop(y)
X=X_train_h2o.col_names
X_test_h2o = df_test_h2o.drop(y)

H2O AutoML

H2OAutoML is used to set the hyperparameters which is then used for training the data

max_models=10: The value of 10 means that the process will build up to 10 different models and select the best one based on a predefined metric. I have chosen 10 to speed up training time and save computing power.
seed=10: This parameter sets a fixed random seed to ensure that the results of the training process are reproducible. 10 is a random number you can also select 0 if you do not want the results to be reproducible
verbosity= “debug”, “info”, and “warn” are the three options default is None “debug”: It prints maximum debug-based information.
nfolds=0: This parameter specifies the number of folds to use in cross-validation. A value of 0 means that no cross-validation will be performed and the model will be trained on the entire dataset. Setting cross-validation to zero may lead to overfitting but as later I tested the accuracy of the test data is good. Adding cross-validation leads to more training time and would consume more power

aml = H2OAutoML(max_models=10,seed=seed,verbosity="debug",nfolds=0)
aml.train(x=X, y=y, training_frame=df_train_h2o)

RMSE (Root Mean Squared Error) is a measure of the difference between predicted values and observed values in a regression analysis. It is calculated as the square root of the average of the squared differences between the predicted and actual values.

LogLoss (Logarithmic Loss) is a measure of how well a probabilistic classification model predicts the probability of each possible outcome. It measures the accuracy of the probabilities predicted by the model compared to the true probabilities.

Mean Per-Class Error is a classification error metric that measures the average error rate across all classes. It is calculated by taking the average of the per-class error rates, where the per-class error rate is the proportion of misclassified samples in a given class.

AUC (Area Under the Curve) is a metric that measures the performance of a binary classification model based on the area under the receiver operating characteristic (ROC) curve. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds.

AUCPR (Area Under the Precision-Recall Curve) is a metric used to evaluate the performance of classification models. It measures the ability of the model to predict the positive class correctly while minimizing the false positives. The AUCPR measures the area under the precision-recall curve, which is a plot of precision against recall for different classification thresholds. The higher the AUCPR, the better the model’s performance.

Gini is a measure of the inequality of a distribution, commonly used in economics. In the context of classification models, Gini is often used as a measure of the quality of the split of a dataset in a decision tree. A Gini coefficient of 0 represents perfect equality, while a Gini coefficient of 1 represents perfect inequality.

# Get the best model
X_test_h2o = df_test_h2o.drop(y)
best_model = aml.leader
lb = aml.leaderboard
best_model_id = lb[0, "model_id"]

prediction = aml.leader.predict(X_test_h2o)

prediction=prediction.as_data_frame() 
prediction=prediction['predict'].tolist()
print(prediction)
X_train_2000 = shap.utils.sample(
    X_train, 2000
)  # Taking 100 samples out for SHAP analysis as it is a computationally expensive process
X_test_2000 = shap.utils.sample(
    X_test, 2000
)  # Taking 100 samples out for SHAP analysis as it is a computationally expensive process

Interpreting AutoML

from h2o import explanation
explanation = aml.explain(df_train_h2o)
print(explanation)

Interpreting SHAP Feature Importance Plot for Linear and Tree-based model

1. Shap Explainer

It’s easy to understand the concept underlying SHAP feature importance: Features with high absolute Shapley values are significant. We average the absolute Shapley values for each feature throughout the data because we want to determine global relevance. The features are then sorted and plotted in decreasing order of relevance.

The plots listed below are

1. The significance of SHAP features for a linear model
2. Importance of SHAP characteristic for tree-based model

There is a difference in feature importance for both models, as seen below in the charts.

explainer = shap.Explainer(classifier.predict, X_train_2000, seed=seed)
linear_shap_values = explainer(X_train_2000)

explainer1 = shap.Explainer(dtc.predict, X_train_2000, seed=seed)
tree_shap_values = explainer1(X_train_2000)

for i in X_train.columns:
    # make a standard partial dependence plot
    sample_ind = 18
    shap.partial_dependence_plot(
        i,
        classifier.predict,
        X_train,
        model_expected_value=True,
        feature_expected_value=True,
        ice=False,
        shap_values=linear_shap_values[sample_ind : sample_ind + 1, :],
    )

for i in X_train.columns:
    # make a standard partial dependence plot
    sample_ind = 18
    shap.partial_dependence_plot(
        i,
        dtc.predict,
        X_train,
        model_expected_value=True,
        feature_expected_value=True,
        ice=False,
        shap_values=tree_shap_values[sample_ind : sample_ind + 1, :],
    )

2. Shap Beeswarm

shap.plots.beeswarm() function creates a swarm plot that shows the distribution of SHAP values for each feature in your dataset. The x-axis represents the SHAP value for each feature, while the y-axis represents the different instances or samples in your dataset.

The range of the graph that you’re seeing on the x-axis represents the range of possible SHAP values varies. Negative SHAP values indicate that the feature is negatively correlated with the target variable, while positive SHAP values indicate that the feature is positively correlated with the target variable. SHAP values closer to zero indicate that the feature has little to no effect on the target variable.

The red and blue colors represent the direction of the correlation between each feature and the target variable. Blue colors represent negative correlations (i.e., when a feature has a negative impact on the target variable), while red colors represent positive correlations (i.e. when a feature has a positive impact on the target variable).

The darker the color, the more extreme the SHAP value is for that feature. In other words, the darker the color, the more impact that feature has on the target variable.

Linear values
For linear values, we see the Type of Travel Personal Travel is the most important as most people traveling will prefer personal travel.
Followed by Disloyal Customers as most customers who travel with a new airline often get excited by services offered by a new airline service.

Tree values
For tree values as well we see Type of Travel Personal Travel is the most important as most people traveling will prefer personal travel.
Followed by Inflight wifi service which is second according to Tree values as most people require some kind of activity during flight

shap.plots.beeswarm(linear_shap_values, show=False)

shap.plots.beeswarm(tree_shap_values, show=False)

3. Shap Heatmap

A SHAP heatmap shows the SHAP values for each feature of a model for a set of instances. Each row of the heatmap represents an instance, and each column represents a feature. The color of each cell represents the SHAP value for that feature and instance. Typically, red colors indicate positive SHAP values, while blue colors indicate negative SHAP values.

By visualizing the SHAP values in a heatmap, it is possible to identify which features have the greatest impact on the model’s predictions and how they contribute to individual predictions. This information can be useful for understanding the behavior of a model, identifying biases or errors, and improving its performance.

shap.plots.heatmap(linear_shap_values)  # SHAP HeatMap of a Tree Based Model

shap.plots.heatmap(tree_shap_values)  # SHAP HeatMap of a Tree Based Model

Conclusion

1 The combination of logistic regression and decision tree classifier, along with H2O and H2O hyperparameter tuning and SHAP analysis, provides powerful tools for predictive modeling and data analysis.

2 Logistic regression and decision tree classifiers offer unique advantages and can be used in different scenarios, depending on the nature of the data and the research questions at hand.

3 H2O AutoML is a useful tool for building accurate predictive models without having to manually tune and optimize hyperparameters, and it supports a wide range of supervised learning tasks, including classification and regression, and it can handle both numerical and categorical data.

4 Variable importance is a key concept in understanding the factors that contribute to model predictions and can help to identify important features that drive the model’s performance.

5 By leveraging these techniques, researchers and practitioners can gain valuable insights into complex data and make informed decisions that can drive real-world impact. For instance, we built a predictive model using H2O AutoML to predict the Airline Satisfaction of customers using certain attributes.

References

1. https://towardsdatascience.com/a-deep-dive-into-h2os-automl-4b1fe51d3f3e
2. https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html
3. Pandas Documentation
4. Sklearn Documentation
5. A Complete Guide to Dealing with Missing values in Python
6. “Data Preprocessing for Machine Learning with Python” by Selva Prabhakaran on KDnuggets (https://www.kdnuggets.com/2020/03/data-preprocessing-machine-learning-python.html)
7. Shap official documentation
8. “Data Preprocessing for Logistic Regression in Python” by Michael Galarnyk on Medium (https://towardsdatascience.com/data-preprocessing-for-logistic-regression-in-python-c632418098b0)

Code Link:

https://www.kaggle.com/code/mananrg/combine-data-cleaning-feature-selection-modeling

Authors:

Manan R. Gandhi [https://www.linkedin.com/in/mananrg]
Prof @NikBearBrown

Mentions:

“What We Have Done?” page https://github.com/aiskunks/aiskunks/tree/main/AISkunks/What_We_Have_Done