Machine Learning Credit Risk Modelling : A Supervised Learning. Part 5

Wibowo Tangara
10 min readJan 24, 2024

--

Part 5: Modeling — Train, Test and Evaluate

Part 4: Feature Scaling and Encoding

Medium.com

These are the steps we will conduct in this part of the project:

  1. Divide our dataset into training set and testing set.
  2. Conducting imbalance resampling only on the train set.
  3. Developing several models.
  4. Evaluate the model.

Train — test split (80% — 20%)

As we know, the df_model DataFrame have 242059 rows, based on the size we will do a 80 train — 20 test split. these are the threshold for deciding the split ratio on this project:

  • n < 10.000 : not advised
  • 10.000 < n < 100.000 : 70% train — 30% test
  • 100.000 < n < 1.000.000 : 80% train — 20% test
  • n > 1.000.000 : 99% train — 1% test
X = df_model
Y = df_model['loan_label']

feature_names = X.columns.tolist()

view rawdefine features and get the column names hosted with ❤ by GitHub

After running this code, X will contain the features, Y will contain the target variable, and feature_names will be a list of the names of all the features. These are essential components when setting up data for training a machine learning model. The features (X) and target variable (Y) are used for training the model, and feature_names can be helpful for understanding or visualizing the features during analysis or model interpretation.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

view rawtrain — test split hosted with ❤ by GitHub

After running this code, we will have separate sets of data (X_train, X_test, Y_train, Y_test) that can be used for training and evaluating machine learning models. The training set is used to train the model, and the testing set is used to assess the model’s performance on unseen data.

Conducting imbalance resampling only on the train set

Imbalance resampling is a technique used to address class imbalance in a dataset, where one class has significantly fewer instances than another. Class imbalance is common in many real-world machine learning problems, such as fraud detection, medical diagnosis, and rare event prediction. Imbalance resampling involves adjusting the class distribution in the dataset to give equal or more balanced representation to each class.

One common form of imbalance resampling is known as oversampling and undersampling:

Oversampling:

  • Objective: Increase the number of instances of the minority class.
  • Methods: Duplicate random instances from the minority class or generate synthetic samples using techniques like SMOTE (Synthetic Minority Over-sampling Technique).

Undersampling:

  • Objective: Decrease the number of instances of the majority class.
  • Methods: Randomly remove instances from the majority class or select a subset of instances from the majority class.

These techniques aim to mitigate the impact of class imbalance on model training, as imbalanced datasets can lead to models that are biased toward the majority class.

The reason imbalance resampling is typically applied only to the training set is rooted in the goal of ensuring that the model generalizes well to real-world, imbalanced scenarios. When a model is deployed, it is likely to encounter imbalanced data in the same way as it did during training. Therefore, the testing set should reflect the original class distribution to evaluate the model’s performance under realistic conditions.

loan_category_counts = X_train['loan_label'].value_counts()

colors = ['green', 'red']

plt.bar(loan_category_counts.index, loan_category_counts.values, color=colors)
plt.xlabel('Loan Label')
plt.ylabel('Count')
plt.title('Distribution of Loan Label')

for i, count in enumerate(loan_category_counts.values):
plt.text(i, count, str(count), ha='center', va='bottom', fontsize=10)

plt.show()

view rawtarget distribution plot hosted with ❤ by GitHub

Run the code above to generate a bar plot showing the distribution for each value on the target.

As we can see that there are imbalance between the target value, we need to choose a solution between oversampling or undersampling to solve this problem. The choice between oversampling and undersampling depends on the specific characteristics of the dataset and the nature of the problem we are trying to solve. Both oversampling and undersampling have their advantages and disadvantages, and the decision should be based on a thoughtful analysis of the data and the goals of our machine learning task. Here are some considerations for choosing between oversampling and undersampling:

Nature of the Problem:

Oversampling:

  • Effective when the minority class is underrepresented but contains valuable information.
  • Suitable for scenarios where increasing the number of instances in the minority class is feasible and beneficial.

Undersampling:

  • Appropriate when the majority class has a large number of instances that may not contribute significantly to the learning process.
  • Can be effective when the majority class instances are somewhat redundant or less informative.

Data Size:

Oversampling:

  • Can be applied even when the minority class is significantly smaller than the majority class.
  • Generates synthetic samples or duplicates existing ones to balance class distribution.

Undersampling:

  • Suitable when the dataset is large, and removing instances from the majority class would still leave a sufficient number of samples for training.

Computational Resources:

Oversampling:

  • May increase the size of the dataset, potentially leading to increased computational requirements.

Undersampling:

  • Reduces the size of the dataset, potentially improving computational efficiency.

Quality of Information:

Oversampling:

  • Useful when each instance of the minority class is valuable and contributes unique information to the model.

Undersampling:

  • Suitable when instances of the majority class are somewhat redundant or do not add significant value to the learning process.

Impact on Model Generalization:

Oversampling:

  • May result in a more generalized model as it learns from an increased number of instances in the minority class.

Undersampling:

  • Risk of losing valuable information if instances from the majority class are removed indiscriminately.

Combination Approaches:

Consider hybrid approaches that combine oversampling and undersampling techniques to achieve a balanced representation.

Model Performance:

  • Experiment with both oversampling and undersampling and evaluate model performance using appropriate metrics (precision, recall, F1-score, etc.) on a validation set or through cross-validation.

It’s important to note that there is no one-size-fits-all solution, and the choice between oversampling and undersampling should be guided by a thorough understanding of the dataset and the specific challenges posed by class imbalance in your machine learning task. Experimentation and validation with different techniques are crucial for making an informed decision.

from imblearn.over_sampling import RandomOverSampler

oversample = RandomOverSampler(sampling_strategy = 'not majority')
X_train, Y_train = oversample.fit_resample(X_train, Y_train)

view rawoversampling the train set hosted with ❤ by GitHub

This oversampling approach helps address the issue of class imbalance by introducing synthetic instances of the minority class, making the training set more balanced. The use of RandomOverSampler is mentioned to be simple, easy to implement, and less sensitive to noisy data. However, it’s important to note that oversampling techniques like this introduce synthetic data, and the impact on model performance should be carefully evaluated, especially for potential overfitting, which is will be tested later.

We can see the target value is balanced after the oversampler and ready to be process for the next step.

Train Test and Evaluate

Before we train and test the models we weed to make sure that the independent features (X) both in train set and test set doesn’t have the target variable in it. It is done to ensure that the model is trained and tested on distinct sets of data, and it serves a few important purposes as:

  • Preventing Data Leakage
  • Simulating Real-world Scenarios
  • Conforming to Model Input Requirements
X_train = X_train.drop('loan_label', axis=1)
X_test = X_test.drop('loan_label', axis=1)

view rawdrop target column on X hosted with ❤ by GitHub

After we run the code above, we will going to train and test several different algorithm as follow and choose one of the best:

Random Forest:

  • Type: Ensemble Learning (Bagging)
  • Use Case: Classification and Regression
  • How It Works: Random Forest builds multiple decision trees during training and merges them together to get a more accurate and stable prediction. It introduces randomness by training each tree on a random subset of the data and using a random subset of features for each split.
  • Advantages: Robust to overfitting, handles a large number of features well, provides feature importance.
  • Limitations: May require more computational resources.

Gradient Boosting:

  • Type: Ensemble Learning (Boosting)
  • Use Case: Classification and Regression
  • How It Works: Gradient Boosting builds a series of weak learners (usually decision trees) sequentially, each tree corrects the errors of the previous one, tt combines these weak learners to form a strong predictive model.
  • Advantages: Often provides higher accuracy than Random Forest, handles imbalanced datasets well.
  • Limitations: Sensitive to outliers, may require more tuning.

Logistic Regression:

  • Type: Linear Model
  • Use Case: Binary Classification (can be extended to multi-class)
  • How It Works: Logistic Regression models the probability that an instance belongs to a particular class, it uses the logistic function to transform a linear combination of input features into a probability score.
  • Advantages: Simple and interpretable, efficient for linearly separable data.
  • Limitations: Assumes a linear relationship between features and log-odds, may not perform well on complex data.

k-Nearest Neighbors (kNN):

  • Type: Instance-Based (Lazy Learning)
  • Use Case: Classification and Regression
  • How It Works: kNN classifies or predicts based on the majority class or average value of the k-nearest neighbors in the feature space, The distance metric (usually Euclidean distance) is used to determine proximity.
  • Advantages: Simple and easy to understand, no training phase (lazy learning).
  • Limitations: Sensitive to irrelevant features, computationally expensive for large datasets.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, classification_report

results = {}
models = {
'Random Forest': RandomForestClassifier(random_state=42),
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'K-Nearest Neighbors': KNeighborsClassifier(),
}

classification_reports = {}
model_names = []
accuracies = []

for model_name, model in models.items():
print(f"Training {model_name}...")
model.fit(X_train, Y_train)

print(f"Evaluating {model_name}...")
Y_pred = model.predict(X_test)

confusion = confusion_matrix(Y_test, Y_pred)
classification_rep = classification_report(
Y_test, Y_pred, target_names=['Good', 'Bad'], zero_division=1
)

classification_reports[model_name] = classification_rep

accuracy = accuracy_score(Y_test, Y_pred)

model_names.append(model_name)
accuracies.append(accuracy)

print("\nClassification Report:")
print(classification_rep)
print(f"{model_name} Accuracy: {accuracy:.4f}")
print("=" * 50)

view rawtrain test several models hosted with ❤ by GitHub

The output for each models can be seen as below:

Precision, Recall, and F1 Score are metrics commonly used to evaluate the performance of classification models. They provide insights into the model’s ability to make correct predictions and handle different aspects of the confusion matrix.

  • Precision: Precision, also known as Positive Predictive Value, is a measure of the accuracy of the positive predictions made by the model. It is defined as the ratio of true positive predictions to the total number of positive predictions made by the model (sum of true positives and false positives). Precision is particularly relevant in situations where false positives are costly or have significant consequences.
  • Recall: Recall, also known as Sensitivity or True Positive Rate, measures the ability of the model to capture or recall all the relevant instances of the positive class. It is defined as the ratio of true positive predictions to the total number of actual positive instances in the dataset (sum of true positives and false negatives). Recall is important when the cost of missing positive instances (false negatives) is high.
  • F1 Score: The F1 Score is the harmonic mean of precision and recall. It provides a balanced measure that takes into account both false positives and false negatives. The F1 Score is especially useful when there is an uneven class distribution. The F1 Score ranges between 0 and 1, with 1 indicating perfect precision and recall.

These metrics are commonly used together, as they provide a comprehensive evaluation of the model’s performance, considering both the positive and negative classes. The choice of which metric to prioritize depends on the specific goals and requirements of the task. For example, in a medical diagnosis scenario, where false positives or false negatives may have different consequences, the choice between precision and recall may be crucial. The F1 Score is often used when a balance between precision and recall is desired.

To make it simpler we are going to generate a bar plot to visualize the accuracy of each models.

plt.figure(figsize=(10, 6))
bars = plt.bar(model_names, accuracies, color='skyblue')

for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval, f'{yval:.2f}', ha='center', va='bottom')

plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Accuracy of Different Models')

plt.ylim(0.7, 1)

plt.xticks(rotation=45)
plt.tight_layout()

plt.axhline(0.9, color='black', linewidth=0.8)

plt.show()

view rawgenerate bar plot accuracy hosted with ❤ by GitHub

The output after we run the code above can be seen below.

We can see that the random forest model give the highest accuracy (0.92) where:

  • F1 ≥ 0.9: excellent
  • 0.8 ≤ F1 < 0.9: very good
  • 0.7 ≤ F1 < 0.8: good
  • 0.6 ≤ F1 < 0.7: fair
  • F1 < 0.6: poor

Based on that we will choose Random Forest model for further evaluation.

This conclude the fifth part, we will continue this project on the final part: Advance Evaluation

Medium.com

You can also visit my github public repository for the project below

github repository

This article is first published in https://www.xennialtechguy.id/posts/credit-risk-modelling-part-5/

--

--