Machine Learning Credit Risk Modelling : A Supervised Learning. Part 6

8 min readJan 24, 2024

Part 6: Advance Evaluation

Part 5: Modeling — Train, Test and Evaluate

In this final part we will evaluate the model that we choose on the chapter before (Random Forest) further using several evaluation metrics. The metrics we will use to evaluate are:

ROC, AUC and KS test since these are the most common metrics to evaluate credit risk modeling.
K-fold cross validation to make sure there are no data leakage or overfitting.
lastly we will check for the features importance.

Before we run all the test above, we need to change the target value into binary because the ROC AUC (Receiver Operating Characteristic Area Under the Curve) and KS (Kolmogorov-Smirnov) tests are evaluation metrics commonly used for binary classification models. They are typically applied when the target variable is binary, indicating two possible classes (e.g., 0 and 1, or negative and positive).

Y_train = Y_train.map({'good': 1, 'bad': 0})
Y_train = Y_train.astype(int)

Y_test = Y_test.map({'good': 1, 'bad': 0})
Y_test = Y_test.astype(int)

view raw converting target to binary hosted with ❤ by GitHub

After that we will reinitialize the Random Forest model

rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, Y_train)

y_pred_proba = rfc.predict_proba(X_test)[:][:,1]

df_actual_predicted = pd.concat([pd.DataFrame(np.array(Y_test), columns=['y_actual']), 
                                 pd.DataFrame(y_pred_proba, columns=['y_pred_proba'])], axis=1)
df_actual_predicted.index = Y_test.index

view raw initialize model hosted with ❤ by GitHub

ROC(Receiver Operating Characteristic) AUC(Area Under the ROC Curve) Test

Then we run the code below to run ROC AUC test

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, tr = roc_curve(df_actual_predicted['y_actual'], df_actual_predicted['y_pred_proba'])
auc = roc_auc_score(df_actual_predicted['y_actual'], df_actual_predicted['y_pred_proba'])

plt.plot(fpr, tpr, label='AUC = %0.4f' %auc)
plt.plot(fpr, fpr, linestyle = '--', color='k')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()

view raw Initialize AUC hosted with ❤ by GitHub

The output as seen below

ROC (Receiver Operating Characteristic) Curve:

The ROC curve is a graphical representation of the performance of a binary classification model at various classification thresholds. It plots the True Positive Rate (Sensitivity or Recall) against the False Positive Rate (1 — Specificity) for different threshold values. The ROC curve provides a visual way to assess the trade-off between sensitivity and specificity at different decision thresholds.

A diagonal line (the “random guess” line) in the ROC space represents a model that performs no better than random chance. The goal is for the ROC curve to be as close as possible to the top-left corner, indicating high sensitivity and low false positive rate across different thresholds.

AUC (Area Under the ROC Curve):

The AUC is a scalar value that represents the area under the ROC curve. It provides a single numerical measure of a model’s ability to discriminate between positive and negative instances across various threshold values. AUC ranges from 0 to 1, where a higher AUC indicates better model performance.

AUC = 1: Perfect classifier (the ROC curve covers the entire area under the curve).
AUC = 0.5: Random classifier (the ROC curve is along the diagonal).
AUC < 0.5: Worse than random classifier (the ROC curve is below the diagonal).

Interpretation of AUC:

0.9–1.0: Excellent discrimination.
0.8–0.9: Good discrimination.
0.7–0.8: Acceptable discrimination.
0.6–0.7: Poor discrimination.
0.5–0.6: Fails to discriminate.

AUC is a useful metric for evaluating the overall performance of a binary classification model without specifying a particular decision threshold. It is widely used in machine learning and is especially valuable when dealing with imbalanced datasets or when different costs are associated with false positives and false negatives.

Kolmogorov-Smirnov (KS) Test

Also known as the Kolmogorov-Smirnov goodness-of-fit test, is a non-parametric statistical test used to assess whether a sample distribution differs from a reference probability distribution. It’s particularly useful for comparing a sample distribution to a theoretical distribution or another sample distribution.

The KS test is sensitive to differences in both location and shape of the cumulative distribution functions (CDFs) and is applicable to continuous and discrete distributions.

The test statistic in the KS test is the maximum absolute difference between the empirical cumulative distribution function (ECDF) of the sample and the theoretical cumulative distribution function (CDF) of the reference distribution.

The KS test is often used in various fields, including finance, biology, and engineering, to assess whether a dataset follows a particular distribution.

For example, in the context of credit scoring, the KS statistic is sometimes used to evaluate the discriminatory power of a credit scoring model by comparing the cumulative distribution of credit scores for good and bad credit applicants.

Before we generate KS test, we need to define the variable for the test by running the code below.

df_actual_predicted = df_actual_predicted.sort_values('y_pred_proba')
df_actual_predicted = df_actual_predicted.reset_index()

df_actual_predicted['Cumulative N Population'] = df_actual_predicted.index + 1
df_actual_predicted['Cumulative N Bad'] = df_actual_predicted['y_actual'].cumsum()
df_actual_predicted['Cumulative N Good'] = df_actual_predicted['Cumulative N Population'] - df_actual_predicted['Cumulative N Bad']
df_actual_predicted['Cumulative Perc Population'] = df_actual_predicted['Cumulative N Population'] / df_actual_predicted.shape[0]
df_actual_predicted['Cumulative Perc Bad'] = df_actual_predicted['Cumulative N Bad'] / df_actual_predicted['y_actual'].sum()
df_actual_predicted['Cumulative Perc Good'] = df_actual_predicted['Cumulative N Good'] / (df_actual_predicted.shape[0] - df_actual_predicted['y_actual'].sum())

view raw define variables for KS hosted with ❤ by GitHub

Then we run the code below for generate the test output.

KS = max(df_actual_predicted['Cumulative Perc Good'] - df_actual_predicted['Cumulative Perc Bad'])

plt.plot(df_actual_predicted['y_pred_proba'], df_actual_predicted['Cumulative Perc Bad'], color='r')
plt.plot(df_actual_predicted['y_pred_proba'], df_actual_predicted['Cumulative Perc Good'], color='b')
plt.xlabel('Estimated Probability for Being Bad')
plt.ylabel('Cumulative %')
plt.title('Kolmogorov-Smirnov:  %0.4f' %KS)

view raw innitialize KS hosted with ❤ by GitHub

The output as shown below.

Interpretation of KS:

KS = 0: the distributions of positive and negative samples are identical
0 < KS < 0.2: very small difference between distributions
0.2 ≤ KS < 0.5: moderate difference
KS ≥ 0.5: considerable difference

Both ROC AUC and KS Test for this model shows a good performance.

K-Fold Cross-Validation

K-Fold Cross-Validation is a model evaluation technique that helps assess the performance and generalization ability of a machine learning model. The dataset is partitioned into K equally sized folds (or subsets), and the model is trained and evaluated K times, each time using a different fold as the test set and the remaining folds as the training set. The process results in K performance metrics, and the average or other aggregation of these metrics provides an overall assessment of the model.

K-Fold Cross-Validation can highlight issues with overfitting or underfitting. If a model performs well on the training set but poorly on the test sets, it may indicate overfitting, and cross-validation can help detect such issues.

To generate the K-Fold Cross-Validation we run the code below.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier()

k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X_train, Y_train, cv=k_fold, scoring='accuracy')

for i, score in enumerate(scores, 1):
    print(f'Fold {i}: Accuracy = {score:.4f}')

print(f'Mean Accuracy: {np.mean(scores):.4f}')
print(f'Standard Deviation: {np.std(scores):.4f}')

view raw innitialize k fold cross validation hosted with ❤ by GitHub

The output as shown below.

The K-Fold Cross-Validation result indicate a model with high accuracy (around 98.5%) and low standard deviation, suggesting that the model performs consistently well across different subsets of the data.

Feature Importance

Feature importance refers to a measure of the impact or contribution of each feature (input variable) to the performance of a machine learning model. It quantifies the degree to which a feature influences the model’s predictions. Understanding feature importance is valuable for gaining insights into the relationships between input variables and the target variable, as well as for identifying which features are most influential in making accurate predictions.

There are various methods for calculating feature importance, and different machine learning algorithms may provide different ways to derive these importance scores. Some common methods include:

Decision Tree-Based Methods: Decision tree-based algorithms, such as Random Forest and Gradient Boosting, provide feature importance scores based on how often a feature is used for splitting nodes in the trees and the improvement in prediction accuracy achieved by each split.
Permutation Importance: Permutation importance involves shuffling the values of a single feature and measuring the change in model performance. A greater decrease in performance indicates higher importance for that feature.
Linear Model Coefficients: In linear models (e.g., linear regression, logistic regression), the coefficients associated with each feature indicate their contribution to the predicted output. Larger coefficients suggest higher importance.
Recursive Feature Elimination: Recursive Feature Elimination (RFE) is an iterative method that removes the least important features one by one, retraining the model at each step. The ranking of features based on the order of elimination can indicate their importance.

Why Feature Importance is Needed:

Interpretability: Feature importance provides interpretable insights into the factors influencing a model’s predictions. Understanding which features matter most can help stakeholders make informed decisions.
Feature Selection: Feature importance aids in feature selection by identifying the most relevant features. Reducing the number of features can lead to simpler and more interpretable models, as well as potentially improved performance.
Identifying Key Drivers: For certain applications, identifying the key drivers of a target variable is crucial. Feature importance helps prioritize features based on their impact on predictions.
Model Diagnostics: Feature importance can be used for model diagnostics. If unexpected or counterintuitive feature importance is observed, it might indicate issues with the model or dataset that require further investigation.
Improving Model Understanding: Feature importance enhances the overall understanding of the relationship between input variables and the target variable. It helps answer questions like “What factors contribute the most to the predicted outcomes?”
Avoiding Overfitting: Feature importance can guide the avoidance of overfitting by identifying features that might be causing the model to memorize the training data rather than generalize well to new data.

In summary, feature importance is a valuable tool for understanding, interpreting, and improving machine learning models. It provides actionable insights that can inform decisions related to model complexity, feature selection, and overall model performance.

arr_feature_importances = rfc.feature_importances_
arr_feature_names = X_train.columns.values
    
df_feature_importance = pd.DataFrame(index=range(len(arr_feature_importances)), columns=['feature', 'importance'])
df_feature_importance['feature'] = arr_feature_names
df_feature_importance['importance'] = arr_feature_importances
df_all_features = df_feature_importance.sort_values(by='importance', ascending=False)
df_all_features

view raw generate feature importances hosted with ❤ by GitHub

The code above will show the importance of all features sorted descending, to make it easier to see, we will generate a bar chart showing only the top 10 features by running the code below.

df_top_features = df_all_features.head(10).sort_values(by='importance', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(df_top_features['feature'], df_top_features['importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances')

for index, value in enumerate(df_top_features['importance']):
    plt.text(value, index, f'{value:.4f}', va='center')

plt.show()

view raw top 10 importance on bar chart hosted with ❤ by GitHub

The result are shown below.