Machine learning
Credit Default Prediction — Neural Network Approach
An Illustrative modeling guide using Tensorflow
In our previous posts, we explored the exciting world of machine learning and feature engineering, highlighting the importance of setting up a solid baseline for credit default prediction using the LightGBM model. Today, we’re diving deeper into neural networks, a powerful form of artificial intelligence that's transforming how financial institutions can detect and manage credit defaults. Get ready to dive into our latest blog where we’ll be sharing our firsthand experience on how neural networks can be harnessed to predict credit defaults.
Data preparation
For this blog post, we’ll be utilizing the features we generated in our initial post. We’ll be loading up the data that we meticulously engineered in our previous post and putting it to work.
import pandas as pd
import numpy as np
import pickle
import gc
DATA_PATH='./' # Add the data path in which the training data is stored
train=pd.read_parquet(Data_Path+'train.parquet') # Name your training data as train.parquet
target=pd.read_csv('/content/train_labels.csv').target.values #load the training labels
print('shape of training data is:',train.shape)
display(train.head())
To expedite the loading process via Pandas, we have stored our data in the parquet format, which also helps to reduce the size of our data.
We use the same metric we defined in our previous blog, a combination of AUC and Recall rate at a threshold of 4% coded below:
from sklearn.metrics import roc_curve, roc_auc_score
# code ref: https://www.kaggle.com/code/inversion/amex-competition-metric-python
def evaluation_metric(y_true, y_pred, return_components=False) -> float:
"""evaluation metric for ndarrays"""
def top_four_percent_captured(df) -> float:
"""Corresponds to the recall for a threshold of 4 %"""
df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
four_pct_cutoff = int(0.04 * df['weight'].sum())
df['weight_cumsum'] = df['weight'].cumsum()
df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
def weighted_gini(df) -> float:
df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
total_pos = (df['target'] * df['weight']).sum()
df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
df['lorentz'] = df['cum_pos_found'] / total_pos
df['gini'] = (df['lorentz'] - df['random']) * df['weight']
return df['gini'].sum()
def normalized_weighted_gini(df) -> float:
"""Corresponds to 2 * AUC - 1"""
df2 = pd.DataFrame({'target': df.target, 'prediction': df.target})
df2.sort_values('prediction', ascending=False, inplace=True)
return weighted_gini(df) / weighted_gini(df2)
df = pd.DataFrame({'target': y_true.ravel(), 'prediction': y_pred.ravel()})
df.sort_values('prediction', ascending=False, inplace=True)
g = normalized_weighted_gini(df)
d = top_four_percent_captured(df)
if return_components: return g, d, 0.5 * (g + d)
return 0.5 * (g + d)
Model Architecture:
The architecture of the model we used in training is shown below:
- The
Input
layer is created with the shape(n_inputs,)
, wheren_inputs
is the number of features in the input data. - During training, the activations of a prior layer are normalized using the
batchNormalization
technique in neural networks. It can increase convergence speed, lessen overfitting, and stabilize training.
- The
Dense
layers are used in the given code to transform the input data by applying a set of learned weights to the previous layer’s inputs, followed by an activation function. - The
Concatenate
layer is used to create a skip connection between the output of the secondDense
layer and the output of the firstDense
layer, before the thirdDense
layer. The skip connection allows the output of the firstDense
layer to be directly passed to the thirdDense
layer, without having to pass through the secondDense
layer again. This can help to preserve information from the earlier layers and improve the flow of gradients during backpropagation.
The neural network can be implemented using the below code.
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, Concatenate, Dropout, BatchNormalization
LR_START = 0.01 #defines the learning rate
def my_model(n_inputs=len(features)):
"neural network with skip connection,
Returns a compiled instance of TensorFlow.keras.models.Model."
activation = 'swish'
l1 = 1e-7
l2 = 4e-4
inputs = Input(shape=(n_inputs, ))
x0 = BatchNormalization()(inputs)
x0 = Dense(1024,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x0)
x0 = Dropout(0.1)(x0)
x = Dense(64,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x0)
x = Dense(64,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x)
x = Concatenate()([x, x0])
x = Dropout(0.1)(x)
x = Dense(16,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x)
x = Dense(1,
activation='sigmoid',
)(x)
model = Model(inputs, x)
model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=LR_START,clipvalue= 0.5,clipnorm = 1.0
),loss=tf.keras.losses.BinaryCrossentropy())
return model
- The
Dense
layers in the code have different numbers of neurons and activation functions. The firstDense
layer has 1024 neurons and uses theswish
activation function. The second and thirdDense
layers have 64 neurons and also use theswish
activation function. The fourthDense
layer has 16 neurons and uses theswish
activation function. The swish function is formulated as :
Dropout
layers are used in the given code to apply regularisation by randomly dropping out a fraction of the activations in the previous layer during training. In this code, the fraction of activations that drop out is set to0.1
.Dropout
reduces the co-adaptation of neurons in the same layer, which helps to prevent overfitting.
We define some parameters we will need for our model training
from matplotlib import pyplot as plt
import random
import datetime
import math
from matplotlib.ticker import MaxNLocator
from colorama import Fore, Back, Style
# Cross-validation of the classifier
ONLY_FIRST_FOLD = True
EPOCHS_EXPONENTIALDECAY = 100
VERBOSE = 0 # set to 0 for less output, or to 2 for more output
LR_END = 1e-5 # minimum learning rate possible
CYCLES = 1 #define how many cycles are there in our learning rate
EPOCHS = 200 #total epochs
DIAGRAMS = True
USE_PLATEAU = False # set to True for early stopping, or to False for exponential learning rate decay
BATCH_SIZE = 2048
# Setting seeds for results reproducibility
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
Model Training
We define a function that trains a model given training and validation data, saves the model weights, and plots the corresponding learning curve.
from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OneHotEncoder
''' function for scaling data, training of model and validting it
code ref: https://www.kaggle.com/code/ambrosm/amex-keras-quickstart-1-training
'''
def fit_model(X_tr, y_tr, X_va=None, y_va=None, fold=0, run=0):
global y_va_pred
gc.collect()
start_time = datetime.datetime.now()
scaler = StandardScaler() #scales the training data
X_tr = scaler.fit_transform(X_tr)
if X_va is not None:
X_va = scaler.transform(X_va) #scales the validation data
validation_data = (X_va, y_va)
else:
validation_data = None
# Define the learning rate schedule and EarlyStopping
if USE_PLATEAU and X_va is not None: # use early stopping
epochs = EPOCHS
lr = ReduceLROnPlateau(monitor="val_loss", factor=0.7, # scheduler
patience=4, verbose=VERBOSE)
es = EarlyStopping(monitor="val_loss", #stop training if results does not improve for 12 epochs
patience=12,
verbose=1,
mode="min",
restore_best_weights=True)
callbacks = [lr, es, tf.keras.callbacks.TerminateOnNaN()]
else: # use exponential learning rate decay rather than early stopping
epochs = EPOCHS_EXPONENTIALDECAY
def exponential_decay(epoch):
# v decays from e^a to 1 in every cycle
# w decays from 1 to 0 in every cycle
# epoch == 0 -> w = 1 (first epoch of cycle)
# epoch == epochs_per_cycle-1 -> w = 0 (last epoch of cycle)
# higher a -> decay starts with a steeper decline
# ref:
a = 3
epochs_per_cycle = epochs // CYCLES
epoch_in_cycle = epoch % epochs_per_cycle
if epochs_per_cycle > 1:
v = math.exp(a * (1 - epoch_in_cycle / (epochs_per_cycle-1)))
w = (v - 1) / (math.exp(a) - 1)
else:
w = 1
return w * LR_START + (1 - w) * LR_END
lr = LearningRateScheduler(exponential_decay, verbose=0)
callbacks = [lr, tf.keras.callbacks.TerminateOnNaN()]
# Construct and compile the model
model = my_model(X_tr.shape[1]) #define and compile model
# Train the model
history = model.fit(X_tr, y_tr, #fit model
validation_data=validation_data,
epochs=epochs,
verbose=VERBOSE,
batch_size=BATCH_SIZE,
shuffle=True,
callbacks=callbacks)
del X_tr, y_tr
with open(f"scaler_{fold}.pickle", 'wb') as f: pickle.dump(scaler, f) #save standard scaler for real-time interaction
model.save(f"model_{fold}") #save model weights
history_list.append(history.history)
callbacks, es, lr, history = None, None, None, None
lastloss = f"Training loss: {history_list[-1]['loss'][-1]:.4f} | Val loss: {history_list[-1]['val_loss'][-1]:.4f}"
# Inference for validation
y_va_pred = model.predict(X_va, batch_size=len(X_va), verbose=0).ravel()
# Evaluation: Execution time, loss and metrics
score = evaluation_metric(y_va, y_va_pred)
print(f"{Fore.GREEN}{Style.BRIGHT}Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}"
f" | {len(history_list[-1]['loss']):3} ep"
f" | {lastloss} | Score: {score:.5f}{Style.RESET_ALL}")
score_list.append(score)
if DIAGRAMS and fold == 0 and run == 0:
# Plot training history
plot_history(history_list[-1],
title=f"Learning curve",
plot_lr=True)
# Scale and predict
y_pred_list.append(model.predict(scaler.transform(test), batch_size=128*1024, verbose=0).ravel())
The code-defined exponential decay scheduler is a learning rate scheduler that reduces the learning rate of a neural network model exponentially over time. Also in the below figure, it can be seen the green line defines the exponential decay of the learning rate.
When it comes to validating your machine learning model, choosing the right strategy is crucial. In order to tackle the issue of imbalanced data in our credit default prediction problem, we implemented the 10-stratified K-fold validation strategy as mentioned in our previous blog. By dividing the data into 10 equally sized folds, each fold is representative of the overall distribution of the data, ensuring that your model is tested on a diverse range of samples. With 90% of the data used for training and 10% reserved for validation, you can rest assured that your model is being put to the test under the most rigorous conditions.
history_list = []
score_list = []
y_pred_list = []
kf = StratifiedKFold(n_splits=10) # stratified kfold
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, target)):
y_va = target[idx_va]
tf.keras.backend.clear_session()
gc.collect()#free memory
fit_model(train.iloc[idx_tr][features], target[idx_tr],
train.iloc[idx_va][features], y_va, fold=fold)
if ONLY_FIRST_FOLD: break # we only need the first fold
Here, we have shown the results of our first fold and training plot we receive a metric score of 0.78543 defined in our first blog.
Now, let's understand what this metric score suggests.
Understanding the Metric Score
The metric we’re using is a combination of the Gini coefficient and a 4% recall rate, calculated as an average value.
- The normalized Gini coefficient is simply a scaled AUC: The normalized Gini coefficient is equal to
2*AUC-1
and is always between -1 and 1. The larger the light red area, the better the score. In our case, we receive a score ofG=0.92012
. - The recall rate captured at a threshold of 4 % corresponds to the y coordinate of the intersection between the green line and the red roc curve (marked with a green dot) and is always between 0 and 1. The higher the intersection point, the better the default detection capability of the model. We receive a score of
R=0.65073
. - We average the score
evaluation_metric=(R+G)/2=.78543
which is the score of our first fold.
Conclusion
In this blog, we discussed how we can leverage Neural Networks for credit default detection. The proposed model receives a score of .78543 for the single fold. The results can further be improved by:
- Tuning model parameters: adding or changing the units in our neural network.
- Adjusting the learning rate.
- Adding more features we discussed in our previous blog.
- Training on all the 10 folds followed by averaging the predictions.
As we continue to explore the potential of machine learning, we encourage you to stay tuned for more exciting developments in this field. Thank you for reading, and we hope this article has inspired you to leverage Neural Networks in your own credit default detection tasks.