Predicting Default Payments of Credit Card Clients

MariceMane
5 min readSep 8, 2022

--

This project is part of the financial field where banks in Taiwan aim to identify users with a high risk of default. These negatively impacted the level of consumer confidence and subsequently contributed to aggravating the financial crisis. Even if the percentage of users with a payment default is minimal, this does not prevent it from being necessary to distinguish this category of customers and to know the reason behind this result of non-payment.

In some cases it is because of a sudden change in a person’s income level and in other cases it is a deliberate act where the client knows he will not be able to pay his financial charges but does not stop using it until the bank blocks the card. This last situation can be considered a type of fraud and is very difficult to predict.

The data we will use in the project is available on Kaggle.

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

There are 25 variables :

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

X2: Gender (1 = male; 2 = female).

X3: Education (1 = graduate school; 2 = university;3 = high school; 4 = others).

X4: Marital status (1 = married; 2 = single; 3 = others).

X5: Age (year).

X6–X11: History of past monthly payment records. (from April to September 2005)

X12–X17: Amount of bill statement (NT dollar). (from April to September 2005).

X18–X23: Amount of previous payment (NT dollar). (from April to September 2005).

Data analysis :

We will start by analyzing the data to better understand it and see what issues we are dealing with.

First we noticed the number of customers with Default=0 is greater than others with Default=1

We have identified the number of customers for each category of education and we can clearly see that the level of education is mainly ‘University’ or ‘Graduate School’

Based on this histogram, we notice that most of the customers with a payment default=0 are single. Meanwhile, married people have a greater tendency to default=1 than other categories

The number of women is quite large than that of men but men are more likely to default.

Regarding the outliers, customers who are over 60 years old are considered outliers because they are located far from the other values ​​(centered [25,40]) as shown below.

Data Preparation:

First, we will drop the ID, then regroup values 0,5,6, and 4 of the feature “EDUCATION” into 4 (“other”) since we don’t have many observations in those categories.

For the outliers, we will remove observations where age is superior to 60 and LIMIT_BAL over 600000. And the observations where PAY_AMT1–6>3 because they are negligible compared to the other values

For the unlogical values, the observations of customers who have paid their bills in full every 6 months (PAY_1–6 = -1) while they are classified with a default payment =1 and the observations of which all the columns of ‘BILL_AMT’ are negative, which means that the customers have overpaid their invoices whereas they are classified with a default payment =1 will be deleted.

The observations of customers who have paid while their history for all months mentions that no transaction has been made will also be removed.

We will also consider removing the observations where PAY_AMT and BILL_AMT are equal to 0.

One hot encoding for categorical variables is the next step, the features “EDUCATION” and “MARRIAGE” will each be divided into 4 columns.

Data modeling :

Since our dataset contains a large discrepancy between the values ​​of the observations (especially in the features of LIMIT_BALANCE) we opted for a scaling solution to standardize the values ​​and obtain a good visualization.

For the modeling step, we will be using SVM to predict the default of the customers.

Support Vector Machine (SVM) is a machine learning algorithm in which data is separated into multiple classes using “maximum margin”, with a “hyperplane” separation boundary chosen to maximize the distance between groups of data.

The model has an 81.9% accuracy.

The confusion matrix is as follows.

The ROC (Receiver Operating Characteristic) analysis is used to assess the accuracy of a model’s predictions by plotting the sensitivity against the false positive rate of a classification test. To properly compare the ROC curves, we will base ourselves on the AUC (Area Under Curve) which is 0.65.

According to the performance measures that we have detailed in the previous parts, we notice that SVM scores good results. In fact, this was done with more algorithms (KNN, Logistic regression, Decision trees, Random forest, AdaBoost, light gbm, Discriminant analysis, Naive Bayes, and ANN) And SVM scored the best across all these algorithms.

--

--