MACHINE LEARNING

GRU for Credit Default Prediciton

Discover the Benefits of Recurrent Neural Networks for Credit Risk Assessment

Priyanshu Chaudhary

Published in

CueNex

6 min readMar 15, 2023

In our previous blogs, we looked at how we process raw data containing customers’ credit histories, create features that involve mean, last, and diff features, and implement models such as LGBM and neural networks for default prediction based on the features we implemented. We also looked at some of the best ways of feature engineering that greatly improve the model’s performance.

In this blog, we will be looking at

Transforming tabular data to sequential data.
How to train a GRU model using this dataset for a binary classification problem.

The main benefits of using different models are that we diversify our prediction pipeline and play a key role in performance for fraud detection-based problems (in our case, credit default prediction). Also, we need to be able to leverage the sequential aspect of our data more efficiently. Recurrent neural networks such as GRU and LSTM are built to be trained on sequential (time series) data and hence can create automatic features that can be relevant to our pipeline.

For data pre-processing, we will use the Rapids cudf and cupy libraries. NVIDIA’s Rapids is an open-source library suite that provides a range of GPU-accelerated data processing and machine-learning capabilities. Rapids libraries are built on top of CUDA, NVIDIA’s parallel computing platform, and offer a faster and more efficient approach to working with large-scale datasets.

Tabular to sequential Conversion

Some key points to be noted before conversion:

Our dataset has close to 500k customers.
Each customer has 1–12 credit card statements generated.
Why 1–12 statements only? because a person may have joined later and paid the money(generated credit card statement), not every month. Hence there can be fewer than 12 statements for a customer.
We have a total of 147 features for every customer. i.e. 147 features for every statement.

The tabular data has the shape of [S, F]. Where S is the total number of statements for ~500k customers and F the number of features (147 for our case).

Sequential models like GRUs need data in 3D format. hence we need to convert the provided tabular representation data into the sequential format as depicted below

conversion from Tabular to Sequential

Note:

Since there can be a variable number of bank statements per customer, it is important to make them equal by padding them up to the maximum value (12 in our case). This can be done by using the below code

# Reference: https://www.kaggle.com/code/cdeotte/tensorflow-gru-starter-0-790
import cupy, cudf # GPU LIBRARIES
import numpy as np, pandas as pd # CPU LIBRARIES


train=cudf.read_csv('./train.csv')
targets=cudf.read_csv('./targets.csv')
CATS = ['B_1', 'B_3', 'B_11', 'D_61', 'P_17', 'P_10', 'P_6', 'P_8', 'P_11']

customer = train[['customer_ID']].groupby('customer_ID').customer_ID.agg('count')

more = cupy.array([],dtype='int64') 
for j in range(1,12):
    i = tmp.loc[tmp==j].index.values    #selecting customers with j number of statements
    more = cupy.concatenate([more,cupy.repeat(i,12-j)]) #pad customer 12-j times
df = train.iloc[:len(more)].copy().fillna(0) #fill null values
df = df * 0 - 1 #pad numerical columns with -1  (df*0 makes all values 0 of a dataframe)
df[CATS] = (df[CATS] * 0).astype('int8') #pad categorical columns with 0
df['customer_ID'] = more
train = cudf.concat([train,df],axis=0,ignore_index=True) #concat original and pad data
        
    # ADD TARGETS (and reduce to 1 byte)
train = train.merge(targets,on='customer_ID',how='left')
train.target = train.target.astype('int8')
# FILL NAN
train = train.fillna(-0.5) #this applies to numerical columns

# SORT BY CUSTOMER THEN DATE
train = train.sort_values(['customer_ID','year','month','day']).reset_index(drop=True) 
train = train.drop(['year','month','day'],axis=1)

# REARRANGE COLUMNS WITH 11 CATS FIRST
COLS = list(train.columns[1:])

COLS = ['customer_ID'] + CATS + [c for c in COLS if c not in CATS]
train = train[COLS]

train = train.iloc[:,1:-1].values.reshape((-1,12,147)) # reshape the data into sequential data [customer,statement,features]

Note we must pad the numerical and categorical columns with those values that are not present in that feature else we are introducing noise to our dataset.

Model

Here we define a simple GRU model shown below.

Takes input with shape [Batch_size, Statements, Features]. ([Batch_size,12,147] for our case).
Convert the first 10 categorical features into embeddings.
Use a GRU layer with 64 units that will output [Batch_size,64] a vector. The GRU layer account for accounting for the sequential aspect of data.
The two dense layers are cascaded together the first layer takes the output from GRU and passes it to the second layer from the GRU layer and outputs a vector of shape from the two layers is [Batch_size,64] and [Batch_size,32] respectively.
The final dense layer outputs the probability of customer default.

def gru_model():
    input = tf.keras.Input(shape=(12,147)) # shape of input
    embeds = []
    for k in range(11):
        emb = tf.keras.layers.Embedding(11,5) #convert to embdding of size 5
        embeds.append( emb(inp[:,:,k]) )
    x = tf.keras.layers.Concatenate()([inp[:,:,10:]]+embeds) #first 10 are categorical and remaningg are numerical that can be passed as it is
    
    #GRU layer outputs- [batch_size,64]
    x = tf.keras.layers.GRU(units=64, return_sequences=False)(x)
    
    #Dense layer outputs- [batch_size,64]
    x = tf.keras.layers.Dense(64,activation='relu')(x)
    #Dense layer outputs- [batch_size,32]
    x = tf.keras.layers.Dense(32,activation='relu')(x)
    
    # OUTPUT
    x = tf.keras.layers.Dense(1,activation='sigmoid')(x)
    
    # COMPILE MODEL
    model = tf.keras.Model(inputs=inp, outputs=x)
    opt = tf.keras.optimizers.Adam(learning_rate=0.004)
    loss = tf.keras.losses.BinaryCrossentropy()
    model.compile(loss=loss, optimizer = opt)
    return model


# Define a learning rate scheduler to manipulate learning rate
LR=tf.keras.callbacks.ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.1,
    patience=1,
    verbose=0,
    mode="auto",
    min_delta=0.001,
    cooldown=0,
    min_lr=0,
    **kwargs
)

We will be using the same metric as was used in our previous blogs.

#  Reference: https://www.kaggle.com/competitions/amex-default-prediction/discussion/327534
def default_metric(y_true, y_pred):

    labels     = np.transpose(np.array([y_true, y_pred]))
    labels     = labels[labels[:, 1].argsort()[::-1]]
    weights    = np.where(labels[:,0]==0, 20, 1)
    cut_vals   = labels[np.cumsum(weights) <= int(0.04 * np.sum(weights))]
    top_four   = np.sum(cut_vals[:,0]) / np.sum(labels[:,0])

    gini = [0,0]
    for i in [1,0]:
        labels         = np.transpose(np.array([y_true, y_pred]))
        labels         = labels[labels[:, i].argsort()[::-1]]
        weight         = np.where(labels[:,0]==0, 20, 1)
        weight_random  = np.cumsum(weight / np.sum(weight))
        total_pos      = np.sum(labels[:, 0] *  weight)
        cum_pos_found  = np.cumsum(labels[:, 0] * weight)
        lorentz        = cum_pos_found / total_pos
        gini[i]        = np.sum((lorentz - weight_random) * weight)

    return 0.5 * (gini[1]/gini[0] + top_four)

Training

Here we define our training module:

Takes training data in the form of a 3D- numpy array as input.
Use stratified Kfold( 5 folds) to handle Data Imbalance.
Train a simple GRU model and validate it on our validation data.

PATH_TO_MODEL = './Models'
from sklearn.model_selection import StratifiedKFold   
   # SAVE TRUE AND OOF
    true = np.array([])
    oof = np.array([])#for saving valid predictions
    VERBOSE = 2 # use 1 for interactive 

    USE_FIRST_FOLD = True
    kf= StratifiedKFold(n_splits=5)
    for fold,(trn_idx,val_idx) in enumerate(kf.split(train,target)):
        print('#'*25)
        print(f'### Fold {fold+1}')
        X_train, X_test = X[trn_idx], X[val_idx]
        y_train, y_test = y[trn_idx], y[val_idx]

        # BUILD AND TRAIN MODEL
        model = gru_model()
        h = model.fit(X_train,y_train, 
                      validation_data = (X_valid,y_valid),
                      batch_size=512, epochs=8, verbose=VERBOSE,
                      callbacks = [LR])
        #batch size =512, use 8 epochs only 
        if not os.path.exists(PATH_TO_MODEL): os.makedirs(PATH_TO_MODEL)
        model.save_weights(f'{PATH_TO_MODEL}gru_fold_{fold+1}.h5')

        # INFER VALID DATA
        print('#'*25)
        print('Inferring validation data...')
        p = model.predict(X_valid, batch_size=512, verbose=VERBOSE).flatten()

        print('#'*100)
        print(f'Fold {fold+1} CV=', amex_metric_mod(y_valid, p) )
        print('#'*100)
        true = np.concatenate([true, y_valid])
        oof = np.concatenate([oof, p])
        
        # CLEAN MEMORY
        del model, X_train, y_train, X_valid, y_valid, p
        gc.collect()

    if(USE_FIRST_FOLD):
        break;
    # PRINT OVERALL RESULTS
    if(!USE_FIRST_FOLD):
      print('#'*25)
      print(f'Overall CV =', default_metric(true, oof) )

#########################
### Fold 1
#########################
Epoch 1/8
765/765 - 16s - loss: 0.2383 - val_loss: 0.2307
Epoch 2/8
765/765 - 9s - loss: 0.2258 - val_loss: 0.2299
Epoch 3/8
765/765 - 9s - loss: 0.2230 - val_loss: 0.2266
Epoch 4/8
765/765 - 9s - loss: 0.2210 - val_loss: 0.2294
Epoch 5/8
765/765 - 9s - loss: 0.2191 - val_loss: 0.2295
Epoch 6/8
765/765 - 9s - loss: 0.2176 - val_loss: 0.2242
Epoch 7/8
765/765 - 9s - loss: 0.2156 - val_loss: 0.2237
Epoch 8/8
765/765 - 9s - loss: 0.2142 - val_loss: 0.2254
Inferring validation data...
180/180 - 1s

Fold 1 CV= 0.7866987109852964

We receive a score of 0.78669 on our defined metrics for the first fold. The score can further be improved by

Training for all folds.
Adding more layers to the network.
More preprocessing of the data.
Changing learning rate, Learning rate scheduler.

Things that can didn’t for our dataset but are worth trying:

Use bidirectional GRU/LSTM
Use of 1D CNN block
Using all hidden states or some modifications of the states output from GRU layers.

Conclusion

To sum up, this blog has highlighted the importance of incorporating sequential aspects of data while modeling, and how a tabular dataset can be transformed into a 3D sequential representation for better results. Although LGBM-based models may outperform other models on sequential data, ensembling and diversifying the pipeline can still benefit from their inclusion. Overall, it is essential to consider the sequential nature of data while building models, as it can significantly impact their performance.