Using SHAP for Feature Analysis — Predicting Smokers using Bio Signals — Part 1 Gradient Boosting Libraries

Jason Sebastian
8 min readAug 6, 2024

--

Yeah first article here, so I wanted to write what I did sometime before. What started as a comparison of just a few libraries, ended up being a feature analysis with new explorations (for me at least) so I'm writing it up.

I started from just trying to compare Gradient Boosting libraries, XGBoost, LightGBM, and CatBoost.

Gradient Boosting Libraries (XGBoost, LightGBM and CatBoost)

On this dataset, about smokers, to classify smokers using bio-signals.

Binary Prediction of Smoker Status using Bio-Signals Dataset on Kaggle

The bio-signals act as features, consisting of

  1. Age
  2. Height(cm)
  3. Weight(kg)
  4. Waist(cm)
  5. Eyesight(left)
  6. Eyesight(right)
  7. Hearing(left)
  8. Hearing(right)
  9. Systolic
  10. Relaxation
  11. Fasting Blood Sugar
  12. Cholesterol
  13. Triglyceride
  14. HDL (High-Density Lipoprotein)
  15. LDL (Low-Density Lipoprotein)
  16. Hemoglobin
  17. Urine Protein
  18. Serum Creatinine
  19. AST (Aspartate Aminotransferase)
  20. ALT (Alanine Aminotransferase)
  21. GTP (Glutamik Piruvik Transaminase)
  22. Dental Caries
  23. Smoking

Then, I ended up trying out SHAP for feature analysis

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.

So here’s what I did

Exploratory Data Analysis

Data distribution of target classes

f, ax = plt.subplots(len(['smoking']), 2, figsize=(12, 5))
plt.subplots_adjust(wspace=0.3)

for col in ['smoking']:
s1 = df[col].value_counts()
N = len(s1)

ax[0].pie(
s1, colors=sns.color_palette('rocket', N),
labels=s1.index.tolist(),
startangle=90, frame=True, radius=1.2,
textprops={'fontsize': 16, 'weight': 'bold'},
autopct='%1.f%%',
)

sns.barplot(
x=s1, y=s1.index, ax=ax[1],
orient='horizontal',
palette='rocket'
)

ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].tick_params(axis='x', which='both', bottom=False, labelbottom=False)
ax[1].set_ylabel('') # Remove y label

for i, v in enumerate(s1):
ax[1].text(v, i+0.1, str(v), color='black', fontweight='bold', fontsize=14)

plt.setp(ax[1].get_yticklabels(), fontweight="bold")
plt.setp(ax[1].get_xticklabels(), fontweight="bold")
ax[1].set_xlabel(col, fontweight="bold", color='black', fontsize=14)

f.suptitle(f'Dataset Distribution of {col}', fontsize=20, fontweight='bold', y=1.05)

plt.tight_layout()
plt.show()
Data Distribution of Target Classes

The distribution of smokers and non-smokers is mostly balanced, with a small gap in numbers between the two.

Missing data, duplicates, and more information about the dataset

# Check for missing values
df.isnull().sum()

# Create summary table
summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
summary['missing_data'] = df.isnull().sum()
summary['unique_values'] = df.nunique()
summary['count'] = df.count()
summary['duplicate'] = df.duplicated().sum()
summary['min'] = df.min()
summary['max'] = df.max()
summary['mean'] = df.mean()
summary['std_dev'] = df.std()
summary
Dataset Summary

Turns out the data was clean, with no missing values, and no duplicates and we can see the spread of the data from the image as well

Checked for outliers with two approaches (just to be sure)

# using z-score
z_scores = stats.zscore(target)
outliers = (z_scores > 3) | (z_scores < -3)
outliers.value_counts()p
# using IQR
Q1 = np.percentile(target, 25)
Q3 = np.percentile(target, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = (target < lower_bound) | (target > upper_bound)
outliers.value_counts()

Both seem to say that there were no outliers, so that's good.

Data Preprocessing

Data seems to be good, let's do some stuff to it first

Dropping the id, we don’t need it

df = df.drop(['id'], axis=1)
df.head()

Formatting the feature names to use underscores, instead of whitespace

# format feature names to use underscore instead of whitespace
df.columns = df.columns.str.replace(' ', '_')
df.columns

From the figure up above, seems like the data can be standardized more, in this experiment let's use StandardScaler

# Doing the actual scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
X
Standardized Data

Looks good, let’s split the dataset now

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fine Tuning the Models

The data’s ready, time to tune the models, here’s what I’ll be tuning

Fine Tune Hyperparameters

Tuning XGBoost

import xgboost as xgb
# # grid search
xgb_model = xgb.XGBClassifier()
parameters = {
'learning_rate': [0.1],
'max_depth': [6, 8, 10],
'n_estimators': [256, 500, 1000, 2048],
}

xgb_grid = GridSearchCV(xgb_model,parameters,cv=2,n_jobs=5,verbose=True)
xgb_grid.fit(X_train,y_train)

print("Best Score: ", xgb_grid.best_score_)

print("Best Params")
print(xgb_grid.best_params_)

Producing

Hyperparameter Tuning Results for XGBoost

Tuning LightGBM

# grid search
lgb_model = lgb.LGBMClassifier()
parameters = {
'learning_rate': [0.1],
'max_depth': [6, 8, 10],
'n_estimators': [256, 500, 1000, 2048],
}

lgb_grid = GridSearchCV(lgb_model,parameters,cv=2,n_jobs=5)
lgb_grid.fit(X_train,y_train)

print("Best Score: ", lgb_grid.best_score_)

print("Best Params")
print(lgb_grid.best_params_)

Producing

Hyperparameter Tuning Results for LightGBM

Tuning CatBoost

# grid search
cb_model = cb.CatBoostClassifier()
parameters = {
'learning_rate': [0.1],
'max_depth': [6, 8, 10],
'n_estimators': [256, 500, 1000, 2048],
}

cb_grid = GridSearchCV(cb_model,parameters,cv=2,n_jobs=5,verbose=True)
cb_grid.fit(X_train,y_train)

print("Best Score: ", cb_grid.best_score_)

print("Best Params")
print(cb_grid.best_params_)

Producing

Hyperparameter Tuning Results for CatBoost

That done, let’s continue to the real training and experimenting

Training, Prediction and Evaluation

For this part, it’s just mostly on training the model, using it to predict the data, and measuring how good the model is in doing its job by measuring its:

  • Accuracy Score
  • ROC AUC Score

Also measuring by:

  • CrossValidation (StratifiedKFold and CrossValScore)
  • Classification Report

Leggo

XGBoost

Training

xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42, booster="gbtree", learning_rate=0.1, max_depth=6, n_estimators=500)
xgb_model.fit(X_train, y_train)

Prediction and Evaluation

Accuracy Score

xgb_score = xgb_model.score(X_test, y_test)
xgb_score

Which produced 0.7820 accuracy score

ROC AUC Score

y_pred_proba = xgb_model.predict_proba(X_test)[::,1]
xgb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('XGBoost ROC AUC Score:')
print(xgb_roc_auc)

Produced 0.8655 ROC AUC score, with its curve

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

# Assuming you've already calculated y_pred_proba and xgb_roc_auc as in your code

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {xgb_roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('XGBoost (Full Features) ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Show the plot
plt.show()

# Print the AUC score
print(f'XGBoost ROC AUC Score: {xgb_roc_auc:.3f}')
XGBoost (Full Features) ROC Curve

CrossValidation

num_folds = 5
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
xgb_cv_results = cross_val_score(xgb_model, X_train, y_train, cv=stratified_kfold, scoring='roc_auc', n_jobs=-1)

print('XGBoost Cross Validation Results:')
for i, cv_score in enumerate(xgb_cv_results):
print(f'Fold {i+1}: {cv_score}')

print(f'Average score: {xgb_cv_results.mean()}')
print(f'Score standard deviation: {xgb_cv_results.std()}')
XGBoost Cross Validation Results

Classification Report

xgb_report = classification_report(y_test, xgb_pred)
print('XGBoost Classification Report:')
print(xgb_report)
XGBoost Classification Report

LightGBM

Training

lgbc_model = lgb.LGBMClassifier(random_state=42, boosting_type='gbdt', learning_rate=0.1, max_depth=6, n_estimators=256, verbose=-1)
lgbc_model.fit(X_train, y_train)

Prediction and Evaluation

Accuracy Score

lgbc_score = lgbc_model.score(X_test, y_test)
lgbc_score

Which produced 0.7805 accuracy score

ROC AUC Score

y_pred_proba = lgbc_model.predict_proba(X_test)[::,1]
lgb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print("LightGBM ROC AUC Score:")
print(lgb_roc_auc)

Produced 0.8640 ROC AUC score, with its curve

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {lgb_roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LightGBM (Full Features) ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Show the plot
plt.show()

# Print the AUC score
print(f'LightGBM ROC AUC Score: {lgb_roc_auc:.3f}')
# Assuming you've already calculated y_pred_proba and xgb_roc_auc as in your code
LightGBM (Full Features) ROC Curve

CrossValidation

num_folds = 5
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
lgbc_cv_results = cross_val_score(lgbc_model, X_train, y_train, cv=stratified_kfold, scoring='roc_auc', n_jobs=-1)

print('LightGBM Cross Validation Results:')
for i, cv_score in enumerate(lgbc_cv_results):
print(f'Fold {i+1}: {cv_score}')

print(f'Average score: {lgbc_cv_results.mean()}')
print(f'Score standard deviation: {lgbc_cv_results.std()}')

Classification Report

lgb_report = classification_report(y_test, lgbc_pred)
print("LightGBM Classification Report:")
print(lgb_report)

CatBoost

Training

cb_model = cb.CatBoostClassifier(random_state=42, boosting_type='Plain', learning_rate=0.1, max_depth=6, n_estimators=500)
cb_model.fit(X_train, y_train)

Prediction and Evaluation

Accuracy Score

cb_score = cb_model.score(X_test, y_test)
cb_score

Which produced 0.7794 accuracy score

ROC AUC Score

y_pred_proba = cb_model.predict_proba(X_test)[::,1]
cb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print("CatBoost ROC AUC Score:")
print(cb_roc_auc)

Produced 0.8644 ROC AUC score, with its curve

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {cb_roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('CatBoost (Full Features) ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Show the plot
plt.show()

# Print the AUC score
print(f'CatBoost ROC AUC Score: {cb_roc_auc:.3f}')

CrossValidation

num_folds = 5
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
cb_cv_results = cross_val_score(cb_model, X_train, y_train, cv=stratified_kfold, scoring='roc_auc', n_jobs=-1)

print('CatBoost Cross Validation Results:')
for i, cv_score in enumerate(cb_cv_results):
print(f'Fold {i+1}: {cv_score}')

print(f'Average score: {cb_cv_results.mean()}')
print(f'Score standard deviation: {cb_cv_results.std()}')

Classification Report

cb_report = classification_report(y_test, cb_pred)
print("CatBoost Classification Report:")
print(cb_report)

Looks like it’s gonna be a long one, let’s wrap it up after this

Analysis of Results

Let’s summarize what we have

Table Results of Model Training with Complete Features

It can be seen that overall, the values do not differ much, though with XGBoost having the overall highest scores, both by accuracy, ROC AUC and mean CV score values, followed by LightGBM then by CatBoost.

Even visualized they look like this

Model Comparison by Accuracy score

What’s next, well I want to do feature selection and re-train the model with the selected features, here’s what I have now

Feature Importances for All Models

It seems that the one with LightGBM is abit weird, though let’s continue on for now by finding the mean of the feature importances

Mean Feature Importances from All Models

From the image above, it can be seen that the following features are the most important

  1. Triglyceride
  2. GTP
  3. Hemoglobin
  4. LDL
  5. Fasting Blood Sugar
  6. Age
  7. ALT
  8. Waist(cm)
  9. Cholesterol
  10. HDL

But is it, imma write the continued analysis and what happened when I retrain the models, so look forward to it!

Thanks everyone!

--

--