Using SHAP for Feature Analysis — Predicting Smokers using Bio Signals — Part 1 Gradient Boosting Libraries

8 min readAug 6, 2024

Yeah first article here, so I wanted to write what I did sometime before. What started as a comparison of just a few libraries, ended up being a feature analysis with new explorations (for me at least) so I'm writing it up.

I started from just trying to compare Gradient Boosting libraries, XGBoost, LightGBM, and CatBoost.

On this dataset, about smokers, to classify smokers using bio-signals.

Binary Prediction of Smoker Status using Bio-Signals Dataset on Kaggle

The bio-signals act as features, consisting of

Age
Height(cm)
Weight(kg)
Waist(cm)
Eyesight(left)
Eyesight(right)
Hearing(left)
Hearing(right)
Systolic
Relaxation
Fasting Blood Sugar
Cholesterol
Triglyceride
HDL (High-Density Lipoprotein)
LDL (Low-Density Lipoprotein)
Hemoglobin
Urine Protein
Serum Creatinine
AST (Aspartate Aminotransferase)
ALT (Alanine Aminotransferase)
GTP (Glutamik Piruvik Transaminase)
Dental Caries
Smoking

Then, I ended up trying out SHAP for feature analysis

**SHAP (SHapley Additive exPlanations)** is a game theoretic approach to explain the output of any machine learning model.

So here’s what I did

Exploratory Data Analysis

Data distribution of target classes

f, ax = plt.subplots(len(['smoking']), 2, figsize=(12, 5))
plt.subplots_adjust(wspace=0.3)

for col in ['smoking']:
    s1 = df[col].value_counts()
    N = len(s1)

    ax[0].pie(
        s1, colors=sns.color_palette('rocket', N),
        labels=s1.index.tolist(), 
        startangle=90, frame=True, radius=1.2, 
        textprops={'fontsize': 16, 'weight': 'bold'},
        autopct='%1.f%%',
    )
    
    sns.barplot(
        x=s1, y=s1.index, ax=ax[1],
        orient='horizontal',
        palette='rocket'
    )

    ax[1].spines['top'].set_visible(False)
    ax[1].spines['right'].set_visible(False)
    ax[1].tick_params(axis='x', which='both', bottom=False, labelbottom=False)
    ax[1].set_ylabel('')  # Remove y label

    for i, v in enumerate(s1):
        ax[1].text(v, i+0.1, str(v), color='black', fontweight='bold', fontsize=14)

    plt.setp(ax[1].get_yticklabels(), fontweight="bold")
    plt.setp(ax[1].get_xticklabels(), fontweight="bold")
    ax[1].set_xlabel(col, fontweight="bold", color='black', fontsize=14)

f.suptitle(f'Dataset Distribution of {col}', fontsize=20, fontweight='bold', y=1.05)

plt.tight_layout()    
plt.show()

The distribution of smokers and non-smokers is mostly balanced, with a small gap in numbers between the two.

Missing data, duplicates, and more information about the dataset

# Check for missing values
df.isnull().sum()

# Create summary table
summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
summary['missing_data'] = df.isnull().sum()
summary['unique_values'] = df.nunique()
summary['count'] = df.count()
summary['duplicate'] = df.duplicated().sum()
summary['min'] = df.min()
summary['max'] = df.max()
summary['mean'] = df.mean()
summary['std_dev'] = df.std()
summary

Turns out the data was clean, with no missing values, and no duplicates and we can see the spread of the data from the image as well

Checked for outliers with two approaches (just to be sure)

# using z-score
z_scores = stats.zscore(target)
outliers = (z_scores > 3) | (z_scores < -3)
outliers.value_counts()p

# using IQR
Q1 = np.percentile(target, 25)
Q3 = np.percentile(target, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = (target < lower_bound) | (target > upper_bound)
outliers.value_counts()

Both seem to say that there were no outliers, so that's good.

Data Preprocessing

Data seems to be good, let's do some stuff to it first

Dropping the id, we don’t need it

df = df.drop(['id'], axis=1)
df.head()

Formatting the feature names to use underscores, instead of whitespace

# format feature names to use underscore instead of whitespace
df.columns = df.columns.str.replace(' ', '_')
df.columns

From the figure up above, seems like the data can be standardized more, in this experiment let's use StandardScaler

# Doing the actual scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
X

Looks good, let’s split the dataset now

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fine Tuning the Models

The data’s ready, time to tune the models, here’s what I’ll be tuning

Tuning XGBoost

import xgboost as xgb
# # grid search
xgb_model = xgb.XGBClassifier()
parameters = {
              'learning_rate': [0.1],
              'max_depth': [6, 8, 10],
              'n_estimators': [256, 500, 1000, 2048],
              }

xgb_grid = GridSearchCV(xgb_model,parameters,cv=2,n_jobs=5,verbose=True)
xgb_grid.fit(X_train,y_train)

print("Best Score: ", xgb_grid.best_score_)

print("Best Params")
print(xgb_grid.best_params_)

Producing

Hyperparameter Tuning Results for XGBoost

Tuning LightGBM

# grid search
lgb_model = lgb.LGBMClassifier()
parameters = {
    'learning_rate': [0.1],
    'max_depth': [6, 8, 10],
    'n_estimators': [256, 500, 1000, 2048],
}

lgb_grid = GridSearchCV(lgb_model,parameters,cv=2,n_jobs=5)
lgb_grid.fit(X_train,y_train)

print("Best Score: ", lgb_grid.best_score_)

print("Best Params")
print(lgb_grid.best_params_)

Producing

Hyperparameter Tuning Results for LightGBM

Tuning CatBoost

# grid search
cb_model = cb.CatBoostClassifier()
parameters = {
    'learning_rate': [0.1],
    'max_depth': [6, 8, 10],
    'n_estimators': [256, 500, 1000, 2048],
}

cb_grid = GridSearchCV(cb_model,parameters,cv=2,n_jobs=5,verbose=True)
cb_grid.fit(X_train,y_train)

print("Best Score: ", cb_grid.best_score_)

print("Best Params")
print(cb_grid.best_params_)

Producing

Hyperparameter Tuning Results for CatBoost

That done, let’s continue to the real training and experimenting

Training, Prediction and Evaluation

For this part, it’s just mostly on training the model, using it to predict the data, and measuring how good the model is in doing its job by measuring its:

Accuracy Score
ROC AUC Score

Also measuring by:

CrossValidation (StratifiedKFold and CrossValScore)
Classification Report

Leggo

XGBoost

Training

xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42, booster="gbtree", learning_rate=0.1, max_depth=6, n_estimators=500)
xgb_model.fit(X_train, y_train)

Prediction and Evaluation

Accuracy Score

xgb_score = xgb_model.score(X_test, y_test)
xgb_score

Which produced 0.7820 accuracy score

ROC AUC Score

y_pred_proba = xgb_model.predict_proba(X_test)[::,1]
xgb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('XGBoost ROC AUC Score:')
print(xgb_roc_auc)

Produced 0.8655 ROC AUC score, with its curve

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

# Assuming you've already calculated y_pred_proba and xgb_roc_auc as in your code

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {xgb_roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('XGBoost (Full Features) ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Show the plot
plt.show()

# Print the AUC score
print(f'XGBoost ROC AUC Score: {xgb_roc_auc:.3f}')

CrossValidation

num_folds = 5
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
xgb_cv_results = cross_val_score(xgb_model, X_train, y_train, cv=stratified_kfold, scoring='roc_auc', n_jobs=-1)

print('XGBoost Cross Validation Results:')
for i, cv_score in enumerate(xgb_cv_results):
    print(f'Fold {i+1}: {cv_score}')
    
print(f'Average score: {xgb_cv_results.mean()}')
print(f'Score standard deviation: {xgb_cv_results.std()}')

Classification Report

xgb_report = classification_report(y_test, xgb_pred)
print('XGBoost Classification Report:')
print(xgb_report)

LightGBM

Training

lgbc_model = lgb.LGBMClassifier(random_state=42, boosting_type='gbdt', learning_rate=0.1, max_depth=6, n_estimators=256, verbose=-1)
lgbc_model.fit(X_train, y_train)

Prediction and Evaluation

Accuracy Score

lgbc_score = lgbc_model.score(X_test, y_test)
lgbc_score

Which produced 0.7805 accuracy score

ROC AUC Score

y_pred_proba = lgbc_model.predict_proba(X_test)[::,1]
lgb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print("LightGBM ROC AUC Score:")
print(lgb_roc_auc)

Produced 0.8640 ROC AUC score, with its curve

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {lgb_roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LightGBM (Full Features) ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Show the plot
plt.show()

# Print the AUC score
print(f'LightGBM ROC AUC Score: {lgb_roc_auc:.3f}')
# Assuming you've already calculated y_pred_proba and xgb_roc_auc as in your code

CrossValidation

num_folds = 5
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
lgbc_cv_results = cross_val_score(lgbc_model, X_train, y_train, cv=stratified_kfold, scoring='roc_auc', n_jobs=-1)

print('LightGBM Cross Validation Results:')
for i, cv_score in enumerate(lgbc_cv_results):
    print(f'Fold {i+1}: {cv_score}')
    
print(f'Average score: {lgbc_cv_results.mean()}')
print(f'Score standard deviation: {lgbc_cv_results.std()}')

Classification Report

lgb_report = classification_report(y_test, lgbc_pred)
print("LightGBM Classification Report:")
print(lgb_report)

CatBoost

Training

cb_model = cb.CatBoostClassifier(random_state=42, boosting_type='Plain', learning_rate=0.1, max_depth=6, n_estimators=500)
cb_model.fit(X_train, y_train)

Prediction and Evaluation

Accuracy Score

cb_score = cb_model.score(X_test, y_test)
cb_score

Which produced 0.7794 accuracy score

ROC AUC Score

y_pred_proba = cb_model.predict_proba(X_test)[::,1]
cb_roc_auc = roc_auc_score(y_test, y_pred_proba)
print("CatBoost ROC AUC Score:")
print(cb_roc_auc)

Produced 0.8644 ROC AUC score, with its curve

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {cb_roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('CatBoost (Full Features) ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Show the plot
plt.show()

# Print the AUC score
print(f'CatBoost ROC AUC Score: {cb_roc_auc:.3f}')

CrossValidation

num_folds = 5
stratified_kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
cb_cv_results = cross_val_score(cb_model, X_train, y_train, cv=stratified_kfold, scoring='roc_auc', n_jobs=-1)

print('CatBoost Cross Validation Results:')
for i, cv_score in enumerate(cb_cv_results):
    print(f'Fold {i+1}: {cv_score}')
    
print(f'Average score: {cb_cv_results.mean()}')
print(f'Score standard deviation: {cb_cv_results.std()}')

Classification Report

cb_report = classification_report(y_test, cb_pred)
print("CatBoost Classification Report:")
print(cb_report)

Looks like it’s gonna be a long one, let’s wrap it up after this

Analysis of Results

Let’s summarize what we have

Table Results of Model Training with Complete Features

It can be seen that overall, the values do not differ much, though with XGBoost having the overall highest scores, both by accuracy, ROC AUC and mean CV score values, followed by LightGBM then by CatBoost.

Even visualized they look like this