Credit Card Default Prediction

7 min readJun 9, 2020

PROJECT DEFINITION

Project Overview

A Taiwan-based credit card issuer wants to better predict the likelihood of default for its customers, as well as identify the key drivers that determine this likelihood. This would inform the issuer’s decisions on who to give a credit card to and what credit limit to provide. It would also help the issuer have a better understanding of their current and potential customers, which would inform their future strategy, including their planning of offering targeted credit products to their customers.

Business Problem

Credit card is a flexible tool by which you can use bank’s money for a short period of time. If you accept a credit card, you agree to pay your bills by the due date listed on your credit card statement. Otherwise, the credit card will be defaulted. When a customer is not able to pay back the loan by the due date and the bank is totally certain that they are not able to collect the payment, it will usually try to sell the loan. After that, if the bank recognizes that they are not able to sell it, they will write it off. This is called a charge-off. This results in significant financial losses to the bank on top of the damaged credit rating of the customer and thus it is an important problem to be tackled. In this project, I will build a machine learning model which will predict individuals who will default their credit card payment.

Evaluation Metric

The Evaluation Metric used is the accuracy score. After we train on some training data, we will evaluate the performance of the model on some test data. For this, we use the Confusion Matrix

the accuracy of the model : — ( TP + TN ) / Total

Here, TP stands for True Positive which are the cases in which we predicted yes and the actual value was true. TN stands for True Negative which are the cases in which we predicted no and the actual value was false FP stands for False Positive which are the cases which we predicted yes and the actual value was False FN stands for False Negative which are the cases which we predicted No and the actual value was true.

ALGORITHM USED

GRADIENTBOOSTINGCLASSIFIER

Gradient boosting is a popular machine learning algorithm that combines multiple weak learners, like trees, into a one strong ensemble model. This is done by first fitting a model into the data. However, the first model is not likely to fit the model perfectly to the data points so we are left with residuals. We can then fit another tree to the residuals to minimize a loss function that can be the second norm but gradient boosting allows the use of any loss function. This can be iterated for multiple steps which leads to a stronger model and with proper regularization overfitting can be avoided.

ANALYSIS & METHODOLOGY

The Data

(Data source: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset. We acknowledge the following: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)

The credit card issuer has gathered information on 30000 customers. The dataset contains information on 24 variables, including demographic factors, credit data, history of payment, and bill statements of credit card customers from April 2005 to September 2005, as well as information on the outcome: did the customer default or not?

The datasets utilizes a binary variable, default on payment (Yes = 1, No = 0) in column 24, as the response variable. There are 23 features in this set:

ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)

After loading the dataset, we will look at our data structure, data types, null values. Then, we will try to find any relationship between our features. The final step will be to plot some of our data at different angles.

Procedure:

Data cleaning: the dataset is very neat, little modification is needed.
EDA: By looking at the column names, I noticed there are columns with very similar names, which imply a potential multicollinearity problem may exist. So I made some plots of features with similar names, and the plots showed a strong correlation between each other, and that indicates feature selection is needed as the model I intended to use is regression. Here is one plot I created using Seaborn

# plot columns with similar names to check the correlationsns.pairplot(df, vars=df.columns[11:17], kind='scatter')
sns.pairplot(df, vars=df.columns[17:23])

df['defaulted'].hist()
plt.xlabel('DEFAULT_PAY')
plt.ylabel('COUNT')
plt.title('Default Credit Card Clients - target value - data unbalance\n (Default = 0, Not Default = 1)')

From the figure above, we see that we have an imbalance data in which the percentage of people that didn’t default is way higher than the percentage of people that default.

# Checking the number of counts of defaulters and non defaulters sexwisesns.countplot(x='SEX', data=df,hue="defaulted")

After exploring the data, it is discovered that Females had the highest tendencies to default. This could be due to several reasons. Those reasons, I was not able to discover from the data.

sns.countplot(x="MARRIAGE", data=df,hue="defaulted", palette="muted")

People that were single tended to default more often than people that were married.

RESULT

Modeling

This step consists in trying to find the best model which can predict our target value (default credit payment the following month).
As we have seen during our EDA, we still need to do some data engineering in order to deal with categorical features and rescale the numerical features. Once the data engineering is done, we will take the following pragmatic approach :

Select list of known classification models

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

Run a baseline model for each of our pre-selected models
Pick the top models base on a specific score (accuracy_score)

So below is the list of pre-selected classification models :

Logistic Regression
Decision Tree Classifier
KNeighborsClassifier
GradientBoostingClassifier

# Putting feature variable to X
X = df.drop('defaulted',axis=1)# Putting response variable to y
y = df['defaulted']# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

The data was divided into train and test sets. The test size is 30% of the total data, which implies that the train size is 70%.

num_folds = 10
seed = 7
scoring = 'roc_auc'
# Spot Check few Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('GB',GradientBoostingClassifier()))results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(abs(cv_results))
    names.append(name)
    msg = '{}: {} {}'.format(name, abs(cv_results.mean()), abs(cv_results.std()))
    print(msg)

We use ROC_AUC to spot check the algorithms because we have an imbalance data set. K-Fold is used because it generally results in a less biased model compare to other methods. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data. This method follows the steps above.

Conclusion

The data was cleaned and explored. I tried different algorithms in building the model. GradientBoostingClassifier Performed best.

As seen from the github repository(link is below) our best performing models to predict the target value (default credit payment status)is Gradient Boosting. The model can be improved if features used for predictions are scaled and the hyperparameters are tuned properly.

Hence, I used the standard scaler from sklearn to evenly distribute the features between -1 and 1.

#scaling the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

github code: https://github.com/Sty-ven/Credit-Card-Default