Definition: (All the code related to this post can be found here: GitHub Link). The Power BI report can be viewed here: https://app.powerbi.com/view?r=eyJrIjoiZjZmMDA0ZTAtNzBhZC00ZDQ2LTlmMzUtOTY2ZjllMmIxZGYzIiwidCI6IjVjYmQwMzc0LTFjZWMtNDVkOS05ZDc4LTIzNTMxMGIzNjY5MCJ9
[Click here to see the portuguese version]
Cluster analysis, or clustering, is a technique used to group similar records based on shared characteristics. These records could be anything from customers, equipment behavior, vehicle routes, temperature distributions in a specific environment, and more.
Why is clustering important in these scenarios? Let’s break it down:
- Customers: For a service-providing company, clustering allows it to identify specific characteristics of each group and determine targeted actions for each, ultimately helping the company achieve its goals.
- Equipment: By clustering equipment based on factors like location, temperature, number of operations performed, and failure rates, a company can determine if certain conditions lead to more frequent failures than others.
And so on…
In our case, we’ll focus on an example involving customers, specifically addressing the topic of CHURN (Customer Cancellation/Attrition).
Business Case
In our example, we actas data analysts for a bank tasked with identifying gaps in customer churn and determining strategies for the commercial team to improve customer relationships.
Context:
We analyzed data with 10,000 records (source: Kaggle Bank Customer Churn Dataset). The company operates in three countries (Germany, France, and Spain), offering a maximum of four products from its portfolio. The dataset includes male and female customers, with general information about their usage of banking services.
Explanation: (The following code was executed in Google Colab but can be run in any Python interpreter: https://github.com/arexdevson/Analises-Python/blob/main/KPrototype_PowerB_I.ipynb)
For this clustering example, I used the unsupervised machine learning technique K-Prototype (which combines the concepts of K-Means and K-Modes).
- Supervised Learning: This approach is used when we feed our predictive model with independent variables to predict a dependent variable. For example, we might input the amount spent on marketing (independent variable) to predict the number of customers acquired (dependent variable). The model then creates a logic to forecast the number of customers based on marketing expenditure.
- Unsupervised Learning: In this case, we do not have a final prediction objective based on the features. Instead, we want the model to identify patterns or characteristics without any prior input from us. In simpler terms, the model analyzes the dataset and outputs the grouping it finds most appropriate based on similarities between records.
We used this technique because our dataset includes both categorical data (e.g., “Country”) and numerical data (e.g., “Bank Balance”).
Example of the Dataset:
Client Information:
- Country: The country where the client lives.
- Age: The age of the client.
- Annual Salary: The estimated annual salary of the client.
- Balance: The balance the client had before churning.
- Active Member: Whether the client was actively using the bank’s app or not.
- Credit Card: Whether the client had a credit card or not.
- Gender: The gender of the client.
- Products Number: The number of products used by the client.
- Score: A score based on the client’s actions.
- Tenure (Years): The number of years the client has been with the bank.
Data Processing: After gaining a basic understanding of the data (checking for missing records, null fields, correct data types, etc.), scaling was applied to the Balance, Credit Score, and Estimated Salary variables. This scaling ensures that these variables are analyzed on the same scale when performing the clustering.
Validation:
Beyond just inputting the code (found here: K-Prototype Analysis on GitHub), an interesting approach to understanding the clusters before actually performing the clustering is to create what’s called the Elbow Curve. In simple terms, this curve helps analyze how much adding new clusters contributes (or doesn’t) to the model’s ability to understand the variance and similarities in the data.
Interpreting the Graph: Looking at the graph above, it becomes evident that the clusters start to show less “distinction” between their characteristics from the 5th cluster onwards. This means that adding new clusters does not significantly contribute to the separation of groups. This behavior is typical in unsupervised models, where this step serves as an evaluation before the final creation of the clusters. In this case, I decided to proceed with 4 clusters.
Additional Validation: Another validation performed was using the concept of PCA (Principal Component Analysis) to reduce the dimensionality of the data to 2D. This helps to visualize and analyze the identified clusters, making it easier to understand how the data is distributed across the different groups.
There is a higher density between Cluster 0 and Cluster 3 (reinforcing the use of these two as focal points). The data is well-separated and linear, showing a clear progression.
Clusters 1 and 2 exhibit greater variance in their data, indicating the possibility of a subgroup in some records that are closer to the two main clusters.
Thus, the separation into four clusters resulted in the following proportions:
EDA (Data Exploration)
Great!
We have successfully segregated our dataset into groups where Clusters 0 and 3 have a larger proportion compared to the others, indicating that these characteristics are more prominent in the dataset. To further explore these characteristics, I developed a report in Power BI (https://app.powerbi.com/view?r=eyJrIjoiZjZmMDA0ZTAtNzBhZC00ZDQ2LTlmMzUtOTY2ZjllMmIxZGYzIiwidCI6IjVjYmQwMzc0LTFjZWMtNDVkOS05ZDc4LTIzNTMxMGIzNjY5MCJ9).
If you perform the same analysis, share it with me, I will be happy to see your vision with your indicators!!
See ya!