Credit Card Clustering (K-Means)

Andrew Dziedzic
Web Mining [IS688, Spring 2022]
5 min readApr 2, 2022

A non-obvious insight I would like to extract from the customer credit card raw data is from a marketing approach, how would a marketing department/unit create marketing campaigns for specific groups of people based upon how many clusters are created, and the specific characteristics of each cluster. The insight would be to understand the differences between customer clusters, and to properly create/modify marketing campaigns for those individuals within each cluster. The motivation would be to properly segment customers based on their credit card data, and run specific targeted campaigns to each customer within each cluster.

The source of the dataset is from a very popular Kaggle dataset consisting of just under 9,000 customer records. Each customer record has a specific customer ID attached to it, followed by 17 metrics. Some of the most popular metrics include balance, purchases, credit limit, payments, and tenure. The dataset is very clean and requires minimal cleansing as the raw data is downloaded directly from the Kaggle website.

After performing Exploratory Data Analysis (EDA), we clearly see that there seems to be equal amount of people who purchase often with their credit cards and equal number of people who rarely do. This is a fascinating finding since when looking at the distribution balance earlier we could see that there are way more people with zero and low balance compared to people with high balance. Intuitively, if there are equal amount of people who purchase frequently and people who do not, then we would expect that the distribution of balance to have both low and high number as well. All of the metrics within the dataset are all left skewed, with a clear maximum occurring at x = 0 for all metrics, except for two (2) metrics, which are purchases and balances. Once we performed the “elbow” method to determine the appropriate k-value to use, we clearly see the same cluster behavior occurring within purchases and balances, and have a clear understanding that these two (2) metrics are the key significant drivers for the four (4) clusters found.

The specific K value that I have selected to use in k-means algorithm is from the elbow method for finding optimal clusters. This ‘elbow’ method plots the within-cluster sum of squares (wcss) value against the cluster value (k value), and when the ‘elbow’ point is identified, this is the optimal k value that should be used in the k-means algorithm. I experiment with different number of clusters from 1–10 and then graph inertia (wcss) against the cluster number. Inertia is the closeness of datapoints in the clusters to the centers, the lower the inertia, the more fitting points are to their respective clusters. I am trying to find the place where the wcss is as low as possible while, at the same time, keeping the number of clusters as low as possible. The optimal number of clusters is four (4), since it is the place where the graph starts to completely flatten out. Having a higher number of clusters will not yield a better result/output. I believe this is the most efficient, mathematical, and best approach to finding an optimal solution for the k value in any k-means algorithm.

Each cluster in my data represents a specific group of individual credit card holders who have very similar quantitative characteristics in common. For example, from a quick analysis, one can specifically see that there exist four clusters that divide individual credit card users based on purchases and balance, specifically how high their purchases and balances are. Looking at individual credit card users balance, cluster #2 has the highest balance, followed by cluster #0, and then clusters #3 and #1 are mixed in the bottom. This is like purchases, cluster #2 has the highest purchase followed by clusters #0, clusters #3, and clusters #1 in order.

Cluster #2 = highest spending individual cardholder

Cluster #0 = medium spending individual cardholder

Cluster #3 = low spending individual cardholder

Cluster #1 = zero balance individual cardholder

In conclusion, with all of this information, including the exploratory data analysis performed before finding the optimal k value and performing k-means clustering algorithm, any credit card company can use these clusters to specifically target individuals in these clusters with differentiating marketing campaigns for each cluster. Each campaign would vary, and these clusters would provide the insight necessary to brainstorm the marketing material needed to target these clusters of individual credit card holders. Individuals in cluster #2 and cluster #0 very clearly have the capability to spend in high amounts, one could use their spending habits to optimize the strategies to get them to increase spending even more than their current spending rate. This analysis also brings to light the great potential that could come from individuals in cluster #3 and cluster #1. The individuals in these clusters have balances but purchase very little. With the right motivation, the right tactics, and the right marketing plan/campaign, the individuals may begin to increase usage of their card and increase spending, which in turn become significant sources of revenue for a credit card company.

The use of visualization libraries was a key feature of the Python code used in a Jupyter notebook:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('fivethirtyeight')
from sklearn.cluster import KMeans

The “elbow” method formula to find the optimal k-value:

from sklearn.cluster import KMeanswcss= []for i in range(1,11): 
km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0)
km.fit(scaled_df)
wcss.append(km.inertia_)

plt.plot(range(1,11),wcss, marker='x')
plt.title('Elbow Method [Finding Optimal K Value]', fontsize =25)
plt.xlabel('# of Clusters [K Value]')
plt.ylabel('WCSS [Inertia]')
#Building the model at 4 clusters, add the label to the dataframe
km = KMeans(n_clusters=4, init='k-means++', n_init=10, max_iter=300, random_state=0)
label = km.fit_predict(scaled_df)
df['label'] = label

Utilizing seaborn library for the orange heatmap:

sns.heatmap(df.corr(), cmap='Oranges', annot=True)
plt.title('Heat Map', fontsize =20)

Utilizing seaborn library for the scatterplot of Balance vs. Purchases:

# A closer look at balance vs purchases with clusters centers
plt.rcParams['figure.figsize'] = (16,12)
sns.scatterplot(df['BALANCE'],df['PURCHASES'], hue=df['label'], palette=['black','red','green','yellow'])
plt.title('Clusters [Balance vs Purchases]')
plt.xlabel('BALANCE')
plt.ylabel('PURCHASES ')

The four (4) clusters below as explained above:

The ‘elbow’ method as explained above:

Optimal K Value found at # of Cluster [K Value] = 4, this is where the last significant bend in the line occurs

Heat map showing correlation of all variables:

--

--