Clustering: Extracting Patterns From Data & Concept + Feature Scaling + PCA [Pt.1]

Customers segmentation based on their credit card usage behavior

Vinicius Nala
10 min readApr 2, 2023

This is the first article of a series in which I will outline an end-to-end clustering project. We will begin with some initial concepts, going to dataset exploratory analysis, passing through data preprocessing, grouping customers into clusters, and at the end, we will perform an analysis of the clusters, plus marketing suggestions.
It’s lots of things to do, so let’s start!

To facilitate your apprenticeship, the notebook is available on my GitHub and Kaggle.

Source Code

Notebook on Kaggle:

Notebook on GitHub:

Table of contents

  • Clustering Definition
  • Supervisionized Learning x Unsupervisionized Learning
  • K-Means Clustering
  • Dataset Dictionary
  • Project Aim
  • EDA (Exploratory Data Analysis)
  • Feature Scaling
  • PCA (Principal Component Analysis)
  • Cluster Visualization with Plotly
  • Conclusion

Clustering Definition

Clustering or data grouping analysis is a set of data mining techniques that aims to make an automatic grouping of the data according to their degree of similarity. The criterion of similarity depends on the problem and algorithm. The result of this process is the division of a data set into a certain amount of groups (clusters).

This is a very important definition, and all that we will do after that is related somewhat to this definition.

Another very important definition is the difference between the two concepts below.

Supervised Learning x Unsupervised Learning

Supervised Learning is an approach where an algorithm is trained on input data that has been labeled for a particular output. The model fits the data capturing the underlying pattern and relationship between the input data and the output labels, enabling it to predict outcomes accurately.

The two anterior data science projects that I did:

I built a supervised machine learning model, which I used to predict a “target variable” (label).

For example, in the first project, I constructed a model to predict if a person would survive or not on Titanic. To do that I provided data containing general characteristics of each passenger and the information if this person survived or not on Titanic (target variable) — the model captured the underlying pattern and relationship between the general characteristics of the passenger and if he survived or not -, then I presented to it neve-before-seen data with general characteristics of some passengers, but without the information if the passengers survived or not, being the model responsible to guess who would survive or not.

Unlike, supervised learning, unsupervised learning uses unlabeled data, i.e. we don’t provide the target variable (label) when training the model. These algorithms aren’t used to predict something but to discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”).

If the difference is still not so clear, along the project you will understand better. Let’s continue!

K-Means Clustering

The goal of clustering is to identify meaningful groups of data. These groups can be analyzed in depth, or be passed as a feature, or as an outcome to a classification or regression model. K-Means was the first clustering method to be developed, has a very simple algorithm, and is still widely used.

K-Means divides the data into a K number of clusters, each cluster has a centroid. The main aim of this algorithm is to minimize the sum of the squared distances of each point to the centroid of its assigned cluster. K-Means does not ensure the clusters will have the same size but finds the clusters that are best separated.

Let’s see a simple example, imagine that we have an N number of records and two variables X, and Y. Suppose we wanna split the data into K = 3 clusters, which means assigning each record (Xi, Yi) to a cluster K.

The algorithm first selects random coordinates for the centroids, and using the euclidean distance formula measures the distance between each data point and each cluster centroid. Then each point is assigned to the closest centroid.

Euclidean Distance

After that, the coordinates of the centroids are re-computed by taking the mean of all data points contained in that cluster. Given an assignment of Nk records to cluster K, the center of the cluster(Xk, Yk) is calculated through the equation:

In simple terms, we are just summing the Xk and Yk of the cluster and dividing by the number of points in that cluster.

After this process, the algorithm computes the sum of squares within each cluster, which is given by the formula:

The K-Means keep repeating the process of measuring the distance between the points and the centroids, assigning each data point to the closest center, and re-computing the new coordinates of the centroids until the sum of squares across all three clusters is minimized.

Repeat the same process many times can require a great computational cost, especially when the amount of data is large.

If you wanna see by yourself how it works, I highly recommend this simulator:

Dataset Dictionary

The dataset that will be used consists of the credit card usage behavior of 8950 customers during 6 months, having 18 behavioral features. The link to the dataset is available on Kaggle.

  • CUST_ID: Identification of Credit Card holder (Categorical)
  • BALANCE: Balance amount left in their account to make purchases
  • BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
  • PURCHASES: Amount of purchases made from account
  • ONEOFF_PURCHASES: Maximum purchase amount done in one-go
  • INSTALLMENTS_PURCHASES: Amount of purchase done in installment
  • CASH_ADVANCE: Cash in advance given by the user
  • PURCHASES_FREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
  • ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
  • PURCHASESINSTALLMENTSFREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
  • CASHADVANCEFREQUENCY: How frequently the cash in advance being paid
  • CASHADVANCETRX: Number of Transactions made with “Cash in Advanced”
  • PURCHASES_TRX: Number of purchase transactions made
  • CREDIT_LIMIT: Limit of Credit Card for user
  • PAYMENTS: Amount of Payment done by user
  • MINIMUM_PAYMENTS: Minimum amount of payments made by user
  • PRCFULLPAYMENT: Percent of full payment paid by user
  • TENURE: Tenure of credit card service for user

Project Aim

Initially, our focus will be to segment customers according to their similarities. After that, we will analyze this segmentation and define an effective credit card marketing strategy.

EDA (Exploratory Data Analysis)

Let’s begin by importing the libraries and the dataset:

Visualizing some characteristics of the dataset:

Through Pandas Profiling we can see a very detailed report about the general characteristics of the data set.

After analyzing the report I concluded that the features CUSTOMER_ID and TENURE are not going to contribute with the segmentation so we can exclude them from the dataset.

Now our dataset has 16 columns.

Let’s see how many missing values there are in the dataset.

Just a few ones, to deal with them I will use KNNImputer: each missing value will be imputed using the mean value from the n_neighbors nearest neighbors, in this case, will be the mean value of the 5 nearest neighbors.

This way, instead of filling all the values with the mean or the median,
which is considered a simple approach, we reduce the risk of biasing the clustering results.

Almost every variable is very right-skewed or very left-skewed, this indicates that maybe there are some outliers, which is not something very surprising since we are working with credit card. Certainly, there will be a small portion of people that have a very high amount of money and credit limit, while the majority portion of people has more or less the same amount of money and credit limit.

Feature Scaling

If you see the frequency columns like BALANCE_FREQUENCY, and PURCHASES_FREQUENCY, they vary between an interval of 0 and 1, 0 means 0% of frequency and 1 means 100% frequency. Comparing them with the columns BALANCE and PURCHASES, you will see that they don’t have a limit to their variation, we only know that their minimum can be 0.

However, if you put these columns in the way they’re right now in the cluster, it will not yield a high-quality cluster because the cluster will understand that 1.00 dollar difference in BALANCE is as significant as 1.00 percent difference in BALANCE_FREQUENCY.

In virtue of that, we need to put all the columns into the same scale, otherwise, will be the same thing as clustering people on their weights in kilograms and heights in meters, is a 1kg difference as significant as a 1m difference in height?

This is why scaling the dataset is a vital part when using clustering techniques. Therefore, we will use Normalizer to scale the dataset, but there are many ways to scale a dataset, and each one is used for a specific situation.

Normalizer, unlike the other methods, works on the rows, not the columns. This seems very unintuitive, but it means that will scale each value according to its line, and not according to the values in its column.

By default, L2 normalization is applied to each observation so the values in a row have a unit norm. Unit norm with L2 means that if each element were squared and summed, the total would equal 1. Moreover, Normalizer transforms all the features into values between -1 and 1.

PCA (Principal Component Analysis)

PCA is a method used in unsupervised machine learning (such as clustering) that reduces high-dimension data to smaller dimensions while preserving as much information as possible. Using PCA before applying clustering algorithm reduces dimensions, data noise, and decreases computation cost.
In this article, the number of features will be reduced to 2 dimensions so that the clustering results can be visualized.

To organize better these two preprocessing steps we will embed them into a unique step with Pipeline:

The dataset will be in this way after the preprocessing:

Cluster Visualization with Plotly

Once done the preprocessing, the only thing that remains is the modeling and visualization. And to do this I will create a function, so we can use it whenever we want.

First, we import the KMeans library for the clustering and the Plotly library for the cluster visualization. Then we write the function that has as parameters: the dataset for training the KMeans model, the pipeline for preprocessing, and the number of clusters in KMeans.

Internally, the function will transform the dataset according to the steps in the Pipeline, assuming that this transformation will only return two columns, which will be named “x”, and “y”. Create the KMeans Cluster, and inside this object, it will be passed the number of clusters previously specified in the parameters of the function as an argument, so we can visualize how many clusters we want. And in the end, we will plot the scatter graph of the clustering, each color will be a cluster.

Now, let’s visualize our clustering with 5 clusters.

Remember: almost all attributes of the KMeans model in the function are with the default values, the only attribute that we can change is the number of clusters of the clustering. Therefore, we can make the clustering even better by changing the parameters of the KMeans model.

Let’s see how would be our clustering with 10 clusters.

Conclusion

In this first part, we started the project learning some concepts and understanding how clustering works. In part two of the series, we will give continuity to the project by learning the most common metrics used to validate a cluster and how to use these metrics to find the ideal number of clusters for this dataset.

In virtue of the content being about metrics, we will have to enter into math equations, so I will explain many things in mathematical terms (what people usually dislike, but the subject being studied requires that be this way), and I will assume that you have at least a basic algebra knowledge.

--

--

Vinicius Nala

🚀 Eternal learner, trying to understand the world through data