MACHINE LEARNING

Credit Default Prediction — Practical Tips for Successful Execution

Improve existing prediction baselines with advanced ML

Priyanshu Chaudhary
CueNex
Published in
8 min readFeb 20, 2023

--

According to TransUnion credit card and personal loan data, delinquencies are expected to increase to levels not seen since 2010 in the United States.

TransUnion forecasted severe credit card delinquencies to rise to 2.6% at the end of 2023 from 2.1% at the close of 2022. Unsecured personal loan delinquency rates will increase to 4.3% from 4.1% in the same timeframe.

Therefore, credit default prediction has always been central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, and minimize risk and exposure, which leads to a better customer experience and sound business economics. By predicting which customers are at the highest risk of defaulting on their credit card accounts, issuers can take proactive steps to minimize risk and exposure.

Background

We implemented this solution for one of the largest payment card issuers in the world. We are sharing practical tips for the successful execution of credit default prediction. This is primarily aimed at machine learning teams and business transformation leaders across startups, and mid-sized to large corporations.

This article encapsulates our successful execution journey of predicting credit defaults, and we will share many innovative/out-of-the-box approaches that worked well during the end-to-end execution of this project. We believe the innovative solutions in this series will improve existing ML baselines for default predictions.

Project Description

The approach involves analyzing a dataset containing credit card transactions and payment records for a group of customers over a 12-month period from 2021–2022.

Specifically, the data includes customer information from their last 1–13 statements. The analysis focuses on various features that are associated with credit card defaults, with the goal of accurately predicting if a customer is likely to default in the next 6 months. By analyzing these features, the aim is to identify patterns and trends that can be used to assess the risk of a customer defaulting on their credit card payments.

Dataset description

The dataset depicted in this article is anonymized and masked to maintain the confidentiality of the customer data. The features can be classified as follows:

  • D_* = Delinquency variables
  • S_* = Spend variables
  • P_* = Payment variables
  • B_* = Balance variables
  • R_* = Risk variables

There are a total of 100 integer features and 100 floating-point features representing a customer’s status for the past 12 months. The dataset includes information on customer statements that can vary from 1 to 13. There can be a gap of 30 to 180 days between each credit card statement for a customer(i.e. credit card statements can be missing for the customer). Each customer is represented by a customer id. The sample data for the first 5 statements of the customer with customer_ID=0 is shown below:

Of the 7 million customer_IDs, 98% have a label of "0" (good customer, no default), and 2% have a label of "1" (bad customer, default).

Types of approach

The data includes the monthly credit card statements of each customer and hence many approaches can be used that leverage the property of time series or using

  1. Decision Tree models/ Neural Networks using aggregated features of customer’s data.
  2. Recurrent Neural Networks/Transformers with raw data of each customer’s data.

Feature Engineering

We start by creating aggregated features for all the customers:
since the data contains categorical, numerical features. We process them separately.

Since we want to predict a binary value of whether a customer will default or not, we can create aggregating features such as:

  1. Average values for the last 13 statements
  2. Last values of the features, i.e, the last status of the customer
  3. The minimum value in the feature
  4. The maximum value in the feature
  5. For categorical variables, we can count the number of unique values

For numerical features, we created aggregated features such as mean, standard deviation, minimum, maximum, median, and last, which can be calculated using the below code snippet.

Num_feats  = df.groupby("customer_ID")[numerical_features].agg(['mean', 'std', 'min', 'max', 'median', 'last'])
Num_feats.columns = ['_'.join(x) for x in Num_feats.columns]

For categorical features, we create count, last, and several unique categories for customer-based features.

Cat_feats  = df.groupby("customer_ID")[categorical_features].agg(['count','last','nunique'])
Cat_feats.columns = ['_'.join(x) for x in Cat_feats.columns]

The dataset includes null values; we fill them with a certain value (-128 in this case) or leave them as nan since boosting trees-based models are capable of accounting for nan values.

A sample of data after feature engineering looks like this

Evaluation Metric

Because the classes are highly imbalanced, accuracy would not be an ideal metric for evaluating a decision tree classifier. The possible metrics that can be used are:

  1. AUC-ROC: The Area Under the Receiver Operating Characteristic curve, which measures the model’s ability to distinguish between Positive and Negative classes
  2. Recall: The proportion of True Positives among all Actual Positive cases
  3. Normalized Gini coefficient: The Normalized Gini Coefficient is equal to 2*AUC-1 and is between -1 and 1
  4. The Default Rate Captured at the p% threshold: True Positive rate (Recall) for a threshold set at p% of the total (weighted) sample count

For the above figure, the red area represents the AUC-ROC score, and the intersection of the green line with the curve represents the default rate captured at p% (the higher the intersection, the better).

We evaluate our approach using the average of the normalized Gini coefficient and the default rate captured at 4%.


def evaluation_metric(y_true, y_pred):
labels = np.transpose(np.array([y_true, y_pred]))
labels = labels[labels[:, 1].argsort()[::-1]]
weights = np.where(labels[:,0]==0, 20, 1) # 0 labels are weighted by 20 due to upsampling of customers with label 0
cut_vals = labels[np.cumsum(weights) <= int(0.04 * np.sum(weights))]
top_four = np.sum(cut_vals[:,0]) / np.sum(labels[:,0]) #calculates default rate at 4%

#calculation of gini=2*AUC-1
gini = [0,0]
for i in [1,0]:
labels = np.transpose(np.array([y_true, y_pred]))
labels = labels [labels[:, i].argsort()[::-1]]
weight = np.where(labels[:,0]==0, 20, 1) #weighing labels
weight_random = np.cumsum(weight / np.sum(weight))
total_pos = np.sum(labels[:, 0] * weight)
cum_pos_found = np.cumsum(labels[:, 0] * weight)
Lorentz = cum_pos_found / total_pos
gini[i] = np.sum((Lorentz - weight_random) * weight)

return 0.5 * (gini[1]/gini[0] + top_four)

Model Evaluation Strategy

Since the data has an imbalanced distribution of classes, using stratified K-folds as a cross-validation strategy is a good choice when dealing with imbalanced datasets. By ensuring that each fold has a similar proportion of the positive class as the overall dataset, you can better evaluate the performance of your model on the minority class. A pictorial representation of stratified K- folds is shown below.

Baseline Model

LightGBM (LGBM) is a good choice for classification tasks, especially when dealing with large datasets as it is found to have state-of-the-art results when dealing with tabular datasets.

LGBM uses gradient boosting to build an ensemble of decision trees that can capture complex interactions between features. A single tree of the LGBM after training on our dataset is shown below.

features = [f for f in train.columns if f != 'customer_ID' and f != 'target']

#for storing validation results.
score_list = []
y_pred_list = []

kf = StratifiedKFold(n_splits=5) # generating 5 stratified folds
target=train["target"]
for fold, (trn_ind, test_ind) in enumerate(kf.split(train, target)):
start_time = datetime.datetime.now()

#creating data for training
X_train = train.iloc[trn_ind][features]
y_train = target[trn_ind]

#creating data fro validation
X_valid = train.iloc[test_ind][features]
y_valid = target[test_ind]

#define lgbm classifier with parameters
model = LGBMClassifier(n_estimators=1000,
learning_rate=0.03, reg_lambda=50,
min_child_samples=2400,
num_leaves=95,
colsample_bytree=0.19,
max_bins=511, random_state=101 )

#to remove the warning by LGBM classifier
with warnings.catch_warnings():
warnings.filterwarnings('ignore', category=UserWarning)

model.fit(X_train, y_train,
eval_set = [(X_valid, y_valid)],
eval_metric=[evaluation_metric],
callbacks=[log_evaluation(100)])

y_va_pred = model.predict_proba(X_va, raw_score=True)
score = evaluation_metric(y_va, y_va_pred)

#save model with best evaluation scores.
n_trees = model.best_iteration_
if n_trees is None:
n_trees = model.n_estimators

print(f"Fold {fold} | {str(datetime.datetime.now() - start_time)[-12:-7]} |"
f" {n_trees:5} trees |"f" Score = {score:.5f}{Style.RESET_ALL}")

score_list.append(score)

if ONLY_FIRST_FOLD: break # for the training of first fold only

#delete the training and validation data to save memory
del X_train, y_train,X_valid, y_valid

print(f"OOF Score: {np.mean(score_list):.5f}{Style.RESET_ALL}")
469 features
[100] valid_0's binary_logloss: 0.247296 valid_0's : 0.764518
[200] valid_0's binary_logloss: 0.22843 valid_0's : 0.779298
[300] valid_0's binary_logloss: 0.223237 valid_0's : 0.786733
[400] valid_0's binary_logloss: 0.220893 valid_0's : 0.790104
[500] valid_0's binary_logloss: 0.219559 valid_0's : 0.791775
[600] valid_0's binary_logloss: 0.218766 valid_0's : 0.792098
[700] valid_0's binary_logloss: 0.218199 valid_0's : 0.793434
[800] valid_0's binary_logloss: 0.217769 valid_0's : 0.79412
[900] valid_0's binary_logloss: 0.21744 valid_0's : 0.794347
[1000] valid_0's binary_logloss: 0.217244 valid_0's : 0.794655
Fold 0 | 04:34 | 1000 trees | Score = 0.794655

Compared to a maximum score of 1, we receive a score of 0.794655 on our evaluation metric which is the average of AUC and default rate measured at 4% for fold 1 (training our model on 80% of the data and validating it on 20%).

We will create a bar chart that displays the 15 most significant features and their corresponding importance scores.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import plotly.express as px
def plotImp(model, X , num = 20, fig_size = (40, 20)):
feature_imp = pd.DataFrame({'Value':model.feature_importances_,'Feature':X.columns})
fig = px.bar(feature_imp.sort_values(by="Value",ascending=False)[0:num], x="Value", y="Feature", orientation='h',color='Feature')
fig.show()
plotImp(model,train[features],15)

This indicates that the last aggregated features are the most important features which represent the most recent statement status and are the strongest indicator of their default risk. In addition, the customer’s Balanced and Delinquency variables are the most important features for accurate default prediction.

LGBMClassifier in Python'slightgbm package provides a plot_tree() method to plot a single decision tree from the learned ensemble of trees. This can be useful to visualize the internal workings of the model and understand how it's making predictions.

Now we have the baseline model ready. In our next blog, we will be discussing more approaches in detail such as:

  1. Introducing more feature engineering techniques.
  2. Leveraging the time series characteristic of the data and using it to train Transformers and Recurrent Neural Networks(such as GRUs).
  3. Using a combination of Tree-based models along with Neural Networks.

We look forward to sharing practical tips from this successful project which would provide a ready-reckoner guide for credit default prediction.

--

--

Priyanshu Chaudhary
CueNex
Writer for

Competitions Master @Kaggle.com, Machine Learning @Expedia