Machine learning

Credit Default Prediction — Neural Network Approach

An Illustrative modeling guide using Tensorflow

Priyanshu Chaudhary

Published in

CueNex

8 min readMar 6, 2023

Artificial Neural Networks structural diagram — **Artificial Neural Network**

In our previous posts, we explored the exciting world of machine learning and feature engineering, highlighting the importance of setting up a solid baseline for credit default prediction using the LightGBM model. Today, we’re diving deeper into neural networks, a powerful form of artificial intelligence that's transforming how financial institutions can detect and manage credit defaults. Get ready to dive into our latest blog where we’ll be sharing our firsthand experience on how neural networks can be harnessed to predict credit defaults.

Data preparation

For this blog post, we’ll be utilizing the features we generated in our initial post. We’ll be loading up the data that we meticulously engineered in our previous post and putting it to work.

import pandas as pd
import numpy as np
import pickle
import gc

DATA_PATH='./'  # Add the data path in which the training data is stored

train=pd.read_parquet(Data_Path+'train.parquet')  # Name your training data as train.parquet
target=pd.read_csv('/content/train_labels.csv').target.values #load the training labels


print('shape of training data is:',train.shape)
display(train.head())

Sample image of the training data — **Sample image of training data**

To expedite the loading process via Pandas, we have stored our data in the parquet format, which also helps to reduce the size of our data.

We use the same metric we defined in our previous blog, a combination of AUC and Recall rate at a threshold of 4% coded below:

from sklearn.metrics import roc_curve, roc_auc_score

# code ref: https://www.kaggle.com/code/inversion/amex-competition-metric-python
def evaluation_metric(y_true, y_pred, return_components=False) -> float:
    """evaluation metric for ndarrays"""
    def top_four_percent_captured(df) -> float:
        """Corresponds to the recall for a threshold of 4 %"""
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(df) -> float:
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(df) -> float:
        """Corresponds to 2 * AUC - 1"""
        df2 = pd.DataFrame({'target': df.target, 'prediction': df.target})
        df2.sort_values('prediction', ascending=False, inplace=True)
        return weighted_gini(df) / weighted_gini(df2)

    df = pd.DataFrame({'target': y_true.ravel(), 'prediction': y_pred.ravel()})
    df.sort_values('prediction', ascending=False, inplace=True)
    g = normalized_weighted_gini(df)
    d = top_four_percent_captured(df)

    if return_components: return g, d, 0.5 * (g + d)
    return 0.5 * (g + d)

Model Architecture:

The architecture of the model we used in training is shown below:

Neural Network Architecture for credit default detection — Neural Network Architecture

The Input layer is created with the shape (n_inputs,), where n_inputs is the number of features in the input data.
During training, the activations of a prior layer are normalized using the batchNormalizationtechnique in neural networks. It can increase convergence speed, lessen overfitting, and stabilize training.

reference:https://arxiv.org/pdf/1502.03167.pdf

The Dense layers are used in the given code to transform the input data by applying a set of learned weights to the previous layer’s inputs, followed by an activation function.
The Concatenate layer is used to create a skip connection between the output of the second Dense layer and the output of the first Dense layer, before the third Dense layer. The skip connection allows the output of the first Dense layer to be directly passed to the third Dense layer, without having to pass through the second Dense layer again. This can help to preserve information from the earlier layers and improve the flow of gradients during backpropagation.

The neural network can be implemented using the below code.

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, Concatenate, Dropout, BatchNormalization

LR_START = 0.01  #defines the learning rate
def my_model(n_inputs=len(features)):
    "neural network with skip connection, 
    Returns a compiled instance of TensorFlow.keras.models.Model."
    activation = 'swish'
    l1 = 1e-7
    l2 = 4e-4
    inputs = Input(shape=(n_inputs, ))
    x0 = BatchNormalization()(inputs)
    x0 = Dense(1024, 
               kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x0)
    x0 = Dropout(0.1)(x0)
    x = Dense(64, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x0)
    x = Dense(64, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x)
    x = Concatenate()([x, x0])
    x = Dropout(0.1)(x)
    x = Dense(16, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x)
    x = Dense(1,
              activation='sigmoid',
             )(x)
    model = Model(inputs, x)
    model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=LR_START,clipvalue= 0.5,clipnorm = 1.0
                                                     ),loss=tf.keras.losses.BinaryCrossentropy())
    return model

The Dense layers in the code have different numbers of neurons and activation functions. The first Dense layer has 1024 neurons and uses the swish activation function. The second and third Dense layers have 64 neurons and also use the swish activation function. The fourth Dense layer has 16 neurons and uses the swish activation function. The swish function is formulated as :

Dropout layers are used in the given code to apply regularisation by randomly dropping out a fraction of the activations in the previous layer during training. In this code, the fraction of activations that drop out is set to 0.1. Dropout reduces the co-adaptation of neurons in the same layer, which helps to prevent overfitting.

We define some parameters we will need for our model training

from matplotlib import pyplot as plt
import random
import datetime
import math
from matplotlib.ticker import MaxNLocator
from colorama import Fore, Back, Style

# Cross-validation of the classifier
ONLY_FIRST_FOLD = True
EPOCHS_EXPONENTIALDECAY = 100
VERBOSE = 0 # set to 0 for less output, or to 2 for more output
LR_END = 1e-5 # minimum learning rate possible
CYCLES = 1 #define how many cycles are there in our learning rate
EPOCHS = 200 #total epochs
DIAGRAMS = True
USE_PLATEAU = False # set to True for early stopping, or to False for exponential learning rate decay
BATCH_SIZE = 2048

# Setting seeds for results reproducibility
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

Model Training

We define a function that trains a model given training and validation data, saves the model weights, and plots the corresponding learning curve.

from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OneHotEncoder

''' function for scaling data, training of model and validting it
code ref: https://www.kaggle.com/code/ambrosm/amex-keras-quickstart-1-training
'''
def fit_model(X_tr, y_tr, X_va=None, y_va=None, fold=0, run=0):

    global y_va_pred
    gc.collect()
    start_time = datetime.datetime.now()
    
    scaler = StandardScaler()  #scales the training data
    X_tr = scaler.fit_transform(X_tr)
    
    if X_va is not None:
        X_va = scaler.transform(X_va)  #scales the validation data
        validation_data = (X_va, y_va)
    else:
        validation_data = None
    # Define the learning rate schedule and EarlyStopping
    if USE_PLATEAU and X_va is not None: # use early stopping
        epochs = EPOCHS
        lr = ReduceLROnPlateau(monitor="val_loss", factor=0.7,  # scheduler
                               patience=4, verbose=VERBOSE)
        es = EarlyStopping(monitor="val_loss",    #stop training if results does not improve for 12 epochs
                           patience=12, 
                           verbose=1,
                           mode="min", 
                           restore_best_weights=True)
        callbacks = [lr, es, tf.keras.callbacks.TerminateOnNaN()]

    else: # use exponential learning rate decay rather than early stopping
        epochs = EPOCHS_EXPONENTIALDECAY

        def exponential_decay(epoch):
            # v decays from e^a to 1 in every cycle
            # w decays from 1 to 0 in every cycle
            # epoch == 0                  -> w = 1 (first epoch of cycle)
            # epoch == epochs_per_cycle-1 -> w = 0 (last epoch of cycle)
            # higher a -> decay starts with a steeper decline
            # ref:
            a = 3
            epochs_per_cycle = epochs // CYCLES
            epoch_in_cycle = epoch % epochs_per_cycle
            if epochs_per_cycle > 1:
                v = math.exp(a * (1 - epoch_in_cycle / (epochs_per_cycle-1)))
                w = (v - 1) / (math.exp(a) - 1)
            else:
                w = 1
            return w * LR_START + (1 - w) * LR_END

        lr = LearningRateScheduler(exponential_decay, verbose=0)
        callbacks = [lr, tf.keras.callbacks.TerminateOnNaN()]
        
    # Construct and compile the model
    model = my_model(X_tr.shape[1])   #define and compile model
    # Train the model
    history = model.fit(X_tr, y_tr,    #fit model 
                        validation_data=validation_data, 
                        epochs=epochs,
                        verbose=VERBOSE,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        callbacks=callbacks)
    del X_tr, y_tr
    with open(f"scaler_{fold}.pickle", 'wb') as f: pickle.dump(scaler, f)  #save standard scaler for real-time interaction 
    model.save(f"model_{fold}") #save model weights
    history_list.append(history.history)
    callbacks, es, lr, history = None, None, None, None
    

    lastloss = f"Training loss: {history_list[-1]['loss'][-1]:.4f} | Val loss: {history_list[-1]['val_loss'][-1]:.4f}"
        
        # Inference for validation
        y_va_pred = model.predict(X_va, batch_size=len(X_va), verbose=0).ravel()
        
        # Evaluation: Execution time, loss and metrics
        score = evaluation_metric(y_va, y_va_pred)
        print(f"{Fore.GREEN}{Style.BRIGHT}Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}"
              f" | {len(history_list[-1]['loss']):3} ep"
              f" | {lastloss} | Score: {score:.5f}{Style.RESET_ALL}")
        score_list.append(score)
        
        if DIAGRAMS and fold == 0 and run == 0:
            # Plot training history
            plot_history(history_list[-1], 
                         title=f"Learning curve",
                         plot_lr=True)

        # Scale and predict
        y_pred_list.append(model.predict(scaler.transform(test), batch_size=128*1024, verbose=0).ravel())

The code-defined exponential decay scheduler is a learning rate scheduler that reduces the learning rate of a neural network model exponentially over time. Also in the below figure, it can be seen the green line defines the exponential decay of the learning rate.

When it comes to validating your machine learning model, choosing the right strategy is crucial. In order to tackle the issue of imbalanced data in our credit default prediction problem, we implemented the 10-stratified K-fold validation strategy as mentioned in our previous blog. By dividing the data into 10 equally sized folds, each fold is representative of the overall distribution of the data, ensuring that your model is tested on a diverse range of samples. With 90% of the data used for training and 10% reserved for validation, you can rest assured that your model is being put to the test under the most rigorous conditions.

history_list = []
score_list = []
y_pred_list = []
kf = StratifiedKFold(n_splits=10) # stratified kfold
for fold, (idx_tr, idx_va) in enumerate(kf.split(train, target)):
    y_va = target[idx_va]
    tf.keras.backend.clear_session() 
    gc.collect()#free memory
    fit_model(train.iloc[idx_tr][features], target[idx_tr], 
              train.iloc[idx_va][features], y_va, fold=fold)
    if ONLY_FIRST_FOLD: break # we only need the first fold

Here, we have shown the results of our first fold and training plot we receive a metric score of 0.78543 defined in our first blog.

Now, let's understand what this metric score suggests.

Understanding the Metric Score

The metric we’re using is a combination of the Gini coefficient and a 4% recall rate, calculated as an average value.

ref:https://www.kaggle.com/code/ambrosm/amex-keras-quickstart-1-training — evaluation metric

The normalized Gini coefficient is simply a scaled AUC: The normalized Gini coefficient is equal to 2*AUC-1 and is always between -1 and 1. The larger the light red area, the better the score. In our case, we receive a score of G=0.92012.
The recall rate captured at a threshold of 4 % corresponds to the y coordinate of the intersection between the green line and the red roc curve (marked with a green dot) and is always between 0 and 1. The higher the intersection point, the better the default detection capability of the model. We receive a score of R=0.65073.
We average the score evaluation_metric=(R+G)/2=.78543 which is the score of our first fold.

Conclusion

In this blog, we discussed how we can leverage Neural Networks for credit default detection. The proposed model receives a score of .78543 for the single fold. The results can further be improved by:

Tuning model parameters: adding or changing the units in our neural network.
Adjusting the learning rate.
Adding more features we discussed in our previous blog.
Training on all the 10 folds followed by averaging the predictions.

As we continue to explore the potential of machine learning, we encourage you to stay tuned for more exciting developments in this field. Thank you for reading, and we hope this article has inspired you to leverage Neural Networks in your own credit default detection tasks.