Clustering based on bank account

Michael Campbell
INST414: Data Science Techniques
4 min readMay 10, 2022

Have you ever wondered what your bank account says about you? Using this dataset from Kaggle called Credit Card Dataset, I hope to find out what clustering tells us about a group of customers. To start off, I looked at columns of interests out of these few options:

CUSTID : Identification of Credit Card holder (Categorical)BALANCE : Balance amount left in their account to make purchasesBALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)PURCHASES : Amount of purchases made from accountONEOFFPURCHASES : Maximum purchase amount done in one-goINSTALLMENTSPURCHASES : Amount of purchase done in installmentCASHADVANCE : Cash in advance given by the userPURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY : How frequently the cash in advance being paidCASHADVANCETRX : Number of Transactions made with "Cash in Advanced"PURCHASESTRX : Number of purchase transactions madeCREDITLIMIT : Limit of Credit Card for userPAYMENTS : Amount of Payment done by userMINIMUM_PAYMENTS : Minimum amount of payments made by userPRCFULLPAYMENT : Percent of full payment paid by userTENURE : Tenure of credit card service for user

To further understand the dataset and see which point would be of interest, I decided to plot all the columns into histograms excluding the ID.

num_cols = df.select_dtypes(exclude=['object']).columns.tolist()df[num_cols].hist(bins=15, figsize=(20, 15), layout=(5, 4));

After having looked at this, I decided on balance and purchases which would hopefully give me information different peoples purchase behavior based on balance. This is what the distribution of the balances and purchases.

From here, I started looking for the number of clusters that I want to use in my K-means classification. To do this, I run the Sklearn Kmeans method in a loop with the number of clusters increasing on each loop. inertia_ gives you the sum of squared error which can be used as a way to see how well your classification fits.

k_rng = range(1,10)sse = []for k in k_rng:   km = KMeans(n_clusters=k)   km.fit(df[['BALANCE','PURCHASES']])   sse.append(km.inertia_)

Once we have that list, we can create a line graph and use the elbow method to pick a reasonable cluster amount.

I decided to go with a cluster of three and this gave me a scatter plot that looks like this:

Now we can see three general clusters in yellow, blue, and purple. The purple seems to be people who have low balances and low purchases. The green cluster are people who have about the same about of purchases, but they have a larger range of balances. The yellow are people who have a higher amount of purchases regardless of balance. I thought it was interesting to see most of the yellow cluster was mostly around lower balance range meaning they used most of there money on a purchase.

Next, I wanted to look at balance and one-off purchase to see if I get similar results. one-off purchase purchases are the maximum purchase amount done in one-go. This would tell us if most of the people are spending there money in one lump sum or its over long periods. I thought it would be important to get some descriptive statistics on that column.

count     8636.000000 mean       604.901438 std       1684.307803 min          0.000000 25%          0.000000 50%         44.995000 75%        599.100000 max      40761.250000

This gives a good understand of the general distribution of the data in one-off purchases. From there, I find my k value by using sum of squared error.

This graph leads me to pick 2 for my number of clusters. A kmeans cluster of 2 would then give me a scatter plot that looks something like this.

This cluster seems somewhat similar to the original cluster in that there is a small amount of people that have small one of purchases and small balances. It also interesting to see that right after this cluster ends, there are some pretty large purchases meaning that they spent most of the balance on that one off purchase. That being send, most people seem to spend smaller amount at one time.

Some of the limitations to the dataset is that it doesn’t seem to be linked to a bank or country, so you don’t know really know the context of why you get certain answers. Aside from this, there also seems to be a lot of zeros in the dataset which may be skewing the data.

Link to code repository: Code

--

--