The default of Credit Card Clients Dataset

9 min readJul 16, 2020

Dataset Information

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Dataset source:- https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

Table of Content:-

Exploratory Data analysis

Hypothesis Testing

Model Implementation

Model Evaluation

Step1:-

Exploratory Data Analysis:- Exploratory data analysis is a way to understand the data and explore the data in their main characters.

Approach for Exploratory Data analysis

maximize insight into a data set;
uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions
develop parsimonious models; and
determine optimal factor settings.

Load the data and import useful library

Dataset description

There are 25 variables:

ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)

Check the null value in the dataset

This shows that our dataset has no null value.

Data visualization:- Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

Check the imbalance

The above graph shows that our dataset is not a huge imbalance but it’s balanced so we use techniques to handle balancing we discuss later.

Basically this is a binary classification problem. The percentage of “Default” class is about <font color = ‘blue’>22%</font>, so the data imbalance is not significant.

Then, let’s take a look at how different predictors affect our target.

education with the target variable

Distribution of Education

From the above plot, we can see that most of the defaulters have the degree of graduate/university/high school. Among them, clients who have a university degrees are more likely to default than others.

sex distribution

Female has more probability of default than male.

Age distribution

Age distribution

As age increases to 30, the probability of default increases. Meanwhile, when clients are over 30, the probability decreases when aging.

Credit Amount Distribution

Credit Amount Distribution

Clients with lower amounts tend to default. Especially those with credit amount around 50000 default most.

Features correlation

For the numeric values, let’s represent the correlation of the features.

Let’s check the correlation of the Amount of bill statement in April — September 2005.

correlation matrix

correlation matrix

Correlation is decreasing with distance between months. Lowest correlations are between Sept-April.

Let’s check the correlation of Amount of previous payment in April — September 2005.

correlation matrix

correlation matrix

There are no correlations between amounts of previous payments for April-Sept 2005.

Let’s check the correlation between Repayment status in April — September 2005.

Hypothesis Testing for variable selection

Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.

Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. When we say that a finding is statistically significant, it’s thanks to a hypothesis test.

Two-tailed T-test :

A two-tailed test is a statistical test in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null hypothesis.

P-value :

The P-value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.

If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.

Degree of freedom :

Now imagine you’re not into hats. You’re into data analysis. You have a data set with 10 values. If you’re not estimating anything, each value can take on any number, right? Each value is completely free to vary. But suppose you want to test the population mean with a sample of 10 values, using a 1-sample t-test. You now have a constraint — the estimation of the mean. What is that constraint, exactly? By definition of the mean, the following relationship must hold: The sum of all values in the data must equal n x mean, where n is the number of values in the data set.

Null Hypothesis:- All variable depends on customers next month payment

Alternate Hypothesis:- All variables do not depend on customers next month payment.

Two sample t-test to check variables are significant or not

Two sample T-Test

Variables and their P-value

Variables and their P-value

Here our level of significance(alpha ) is 0.05. we got variable BILL_AMT4, BILL_AMT5, BILL_AMT6, which are not significance so we reject the null hypothesis.

Train-Test split

Train-test split

Model implementation

Logistic regression
Random Forest
xgboost

Logistic regression:-Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

Logistic regression

How logistic regression class_weight works?

class_weight="balanced" you capture more true events (higher TRUE recall) but also you are more likely to get false alerts (lower TRUE precision)
as a result, the total % TRUE might be higher than actual because of all the false positives
AUC might misguide you here if the false alarms are an issue
no need to change decision threshold to the imbalance %, even for strong imbalance, ok to keep 0.5 (or somewhere around that depending on what you need)

Confusion matrix for logistic regression

How does confusion matrix work?

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don’t have the disease.
false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)
F Score: F1 score is a weighted average score of the true positive (recall) and precision.
Roc Curve: Roc curve shows the true positive rates against the false positive rate at various cut points. It also demonstrates a trade-off between sensitivity (recall and specificity or the true negative rate).
Precision: The precision metric shows the accuracy of the positive class. It measures how likely the prediction of the positive class is correct.

Random Fores:- The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

Random forest model

Random forest confusion matrix output

Codes for Roc curve

Roc curve for random forest

xgboost :-XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on a major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

XGboost model

Classification report

Roc curve

Future work

You can perform hyperparameter tunning on xgboost classification.
you can use some other model like adaptive boosting , SVM, etc for better accuracy.
The ratio of “default” vs “no default” is about 1:3 in the dataset. It may affect the accuracy of each model. There are several ways to solve it. For instance, SMOTE.
Feature Engineering is not applied in this kernel. For real business case, it is better to communicate with different teams to figure out the best approach. After all, “there is no free lunch”. For example, One-Hot Encoding can be used.

Happy reading!! Keep coding!!!!

Reference:-

How does the class_weight parameter in scikit-learn work?

I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression…

stackoverflow.com

Default of Credit Card Clients Dataset

Default Payments of Credit Card Clients in Taiwan from 2005

www.kaggle.com

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

The default of Credit Card Clients Dataset

Dataset Information

Table of Content:-

Check the null value in the dataset

Features correlation

Hypothesis Testing for variable selection

Two-tailed T-test :

P-value :

The P-value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.

Degree of freedom :

Model implementation

Future work

Reference:-

How does the class_weight parameter in scikit-learn work?

I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression…

Default of Credit Card Clients Dataset

Default Payments of Credit Card Clients in Taiwan from 2005

Written by Manish Kumar