Exploratory Data Analysis (EDA) and Customer Segmentation of Credit Score Classification Dataset

9 min readOct 28, 2023

Introduction

In today’s data-driven world, especially in finance and banking, understanding how customers behave is like discovering a valuable treasure. This knowledge is important when it comes to dividing customers into groups. It doesn’t just boost financial success but also helps businesses customize their services to suit different customer groups, ensuring they meet each group’s needs.

In this project, my goal is to conduct exploratory data analysis to extract insights from the data and group customers based on their information. This segmentation will help us delve deeper into their behavior and guide our future investigations in marketing and strategy planning.

Tools

Orange Data Mining

Dataset

The dataset that will be used is from the Kaggle. In this dataset you will find basic bank details and credit-related information from the finance company.

Credit score classification

Given a person’s credit-related information, build a machine learning model that

www.kaggle.com

Overview of the Dataset

The dataset consists of 28 variables and includes some minor data quality issues, such as missing values, outliers, and incorrectly formatted values.

As there were data quality issues, I conducted a cleaning process to address missing values, outliers, and incorrectly formatted values. Furthermore, I decided to select only the variables related to the topic I wanted to investigate which there will be only 16 variables.

Variables description (Referenced by the data author [1])

Customer_ID : Represents a unique identification of a person
Age : Represents the age of the person
Occupation : Represents the occupation of the person
Annual_Income : Represents the annual income of the person
Monthly_Inhand_Salary : Represents the monthly base salary of a person
Delay_from_due_date : Represents the average number of days delayed from the payment date
Num_of_Delayed_Payment : Represents the average number of payments delayed by a person
Credit_Mix : Represents the classification of the mix of credits
Outstanding_Debt : Represents the remaining debt to be paid (in USD)
Credit_Utilization_Ratio : Represents the utilization ratio of credit card
Credit_History_Age : Represents the age of credit history of the person
Payment_of_Min_Amount : whether only the minimum amount was paid by the person
Total_EMI_per_month : Represents the monthly EMI payments (in USD)
Payment_Behaviour : Represents the payment behavior of the customer (in USD)
Monthly_Balance : Represents the monthly balance amount of the customer (in USD)
Credit_Score : Represents the bracket of credit score (Poor, Standard, Good)

Exploratory Data Analysis

1) Correlation between numeric variables

If we want to examine the influence relationship between variables, we can perform a correlation analysis. In my case, I conducted this analysis using Orange and then exported the results to Excel to make them easier to read.

I have found some interested thing form correlation analysis

Outstanding_Debt and Delay_from_due_date have a positive correlation of 0.472. This implies that people with higher outstanding debt will tend to have higher number of days delayed from the payment date.

2) Customer demographics and financial information

Customer credit score distribution: The distribution graph reveals that the top credit score among customers is standard, followed by poor, and good is the lowest.

Customer payment behavior distribution: The distribution graph shows that most customers have High_spent_Medium_value_payments, followed by Low_spent_Small_value_payments as the second most common. The third most common is High_spent_Large_value_payments, and the fourth is Low_spent_Medium_value_payments. The fifth most common is High_spent_Small_value_payments and the last common is Low_spent_Large_value_payments.

3) Customer demographics and financial information group by credit scores

Age and credit score: The graph shows the average age of customers grouped by their credit scores. We can observe that for those with a Good credit score, there tends to be a group of people with an average age of around 37 years. For those with a Poor credit score, there tends to be a group with an average age of around 33 years. Additionally, for those with a Standard credit score, there appears to be a group with an average age of around 35 years.

Annual income and credit score: The graph shows the average annual income of customers grouped by their credit scores. We can observe that for those with a Good credit score, there tends to be a group of people with an average annual income of around 61,000 USD. For those with a Poor credit score, there tends to be a group with an average annual income of around 55,000 USD. Additionally, for those with a Standard credit score, there appears to be a group with an average annual income of around 60,000 USD.

Outstanding debt and credit score: The graph shows the average outstanding debt of customers grouped by their credit scores. We can observe that for those with a Good credit score, there tends to be a group of people with an average outstanding debt of around 750 USD. For those with a Poor credit score, there tends to be a group with an average outstanding debt of around 1,595 USD. Additionally, for those with a Standard credit score, there appears to be a group with an average outstanding debt of around 960 USD. The right graph shows the distribution of the outstanding debt of customers grouped by their credit scores, which also implies the same results as the left graph.

Number of days delayed from the payment date and credit score

Number of days delayed from the payment date and credit score: The graph shows the average number of days delayed from the payment date of customers grouped by their credit scores. We can observe that for those with a Good credit score, there tends to be a group of people with an average number of days delayed from the payment date of around 10 days. For those with a Poor credit score, there tends to be a group with an average number of days delayed from the payment date of around 24 days. Additionally, for those with a Standard credit score, there appears to be a group with an average number of days delayed from the payment date of around 17 days.

For those with a Good credit score, they tend to be a group of customers with an average age of around 37 years, an average annual income of approximately 61,000 USD, an average outstanding debt of about 750USD, and an average of around 10 days delayed from the payment date.

For those with a Poor credit score, they tend to be a group of customers with an average age of around 33 years, an average annual income of approximately 55,000 USD, an average outstanding debt of about 1,595 USD, and an average of around 24 days delayed from the payment date.

For those with a Standard credit score, they tend to be a group of customers with an average age of around 35 years, an average annual income of approximately 60,000 USD, an average outstanding debt of about 960 USD, and an average of around 17 days delayed from the payment date.

4) Customer financial information group by payment behavior

Payment Behavior and Annual Income: The graph shows the average annual income of customers categorized by their payment behavior. It is noticeable that among those with a payment behavior labeled as Low_spent_Medium_value_payments, there tends to be a group of people with the highest average annual income. This is interesting because we might expect those with the High_spent_Large_value_payments payment behavior to have the highest average annual income, since we often think of people with high incomes as having the attitude of “go big or go home”.

Clustering

To gain a deeper understanding of customer behavior, I plan to cluster the customers into segments using the K-means clustering method, an unsupervised machine learning technique. I won’t include Credit_Score in this process since it is a prediction derived from the dataset author, and including it may lead to overlapping information.

The picture above is the result of K-means clustering with K = 3. I chose K = 3 because the orange tool cannot compute the silhouette score for datasets with more than 5,000 samples, making it impossible to use the silhouette score to determine K. Additionally, the Credit_Score and Credit_Mix variables exhibit three categories of customer behavior, so I decided to use K = 3 for the clusters as well.

Exploring on cluster group

Customer cluster group distribution: The distribution graph reveals that the top cluster group among customers is Cluster 3 (C3), followed by Cluster 1 (C1), and Cluster 2 (C2) is the lowest.

Age and cluster group: The graph shows the average age of customers grouped by their cluster group. We can observe that for Cluster 1 (C1), there tends to be a group of people with an average age of around 37 years. For Cluster 2 (C2), there tends to be a group with an average age of around 32 years. Additionally, for Cluster 3 (C3), there appears to be a group with an average age of around 34 years.

Annual income and cluster group: The graph shows the average annual income of customers grouped by their cluster group. We can observe that for Cluster 1 (C1), there tends to be a group of people with an average annual income of around 62,000 USD. For Cluster 2 (C2), there tends to be a group with an average annual income of around 49,000 USD. Additionally, for Cluster 3 (C3), there appears to be a group with an average annual income of around 59,000 USD.

Outstanding debt and cluster group: The graph shows the average outstanding debt of customers grouped by their cluster group. We can observe that for Cluster 1 (C1), there tends to be a group of people with an average outstanding debt of around 730 USD. For Cluster 2 (C2), there tends to be a group with an average outstanding debt of around 2,450 USD. Additionally, for Cluster 3 (C3), there appears to be a group with an average outstanding debt of around 1,000 USD.

Number of days delayed from the payment date and cluster group

Number of days delayed from the payment date and cluster group: The graph shows the average outstanding of customers grouped by their cluster group. We can observe that for Cluster 1 (C1), there tends to be a group of people with an average number of days delayed from the payment date of around 10 days. For Cluster 2 (C2), there tends to be a group with an average number of days delayed from the payment date of around 35 days. Additionally, for Cluster 3 (C3), there appears to be a group with an average number of days delayed from the payment date of around 19 days.

Cluster 1 (C1) is the second-largest group, with an average age of around 37 years, an average annual income of approximately 62,000 USD, an average outstanding debt of about 730 USD, and an average of around 10 days delayed from the payment date.

Cluster 2 (C2) is the smallest group, with an average age of around 32 years, an average annual income of approximately 49,000 USD, an average outstanding debt of about 2,450 USD, and an average of around 35 days delayed from the payment date.

Cluster 3 (C3) is the largest group, with an average age of around 34 years, an average annual income of approximately 59,000 USD, an average outstanding debt of about 1,000 USD, and an average of around 19 days delayed from the payment date.

Conclusion

My journey through Exploratory Data Analysis (EDA) and Customer Segmentation with the Credit Score Classification Dataset has been a valuable learning experience. I’ve gained insights into handling large datasets and improving my understanding of basic bank details and credit-related information. During this project, I uncovered interesting findings about the dataset and how to group customers.

All in all, I’m so excited about my journey into data-driven analysis within the financial and banking dataset. I’m looking forward to discovering more in this field.

Presentation Slide

https://drive.google.com/file/d/1BdG9dbxY6eTpVphEMUVT8s9XQOiTW28-/view?usp=sharing

Reference

[1] https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv