Kmeans Analysis on Bank Customer Data
Introduction: Clustering data is one of the most common ways data can be analyzed. It works best with numerical data, as it makes it easier for trends to be found and grouped together in clusters. The goal of the clusters is to determine the different groupings that can be found in a set of data. When I began looking for data to use, I found data on customers at banks and decided that it could be a cool topic to look into. I do not have much prior knowledge on bank data other than the data on a personal bank account and decided it would be interesting to see how Kmeans clusters people who use banks.
1. Question and Stakeholder
The question I came up with is: What are the underlying patterns in the customer groups that provide business to this bank? For this question, the stakeholders could be: The bank’s marketing and customer retention team are interested in understanding the factors contributing to customer groups to implement strategies for customer retention and reduce revenue loss. There are other groups of people that could be stakeholders, like the customers themselves, or people like me who are simply interested in studying trends at banks, however the stakeholders that would get the most significant use of this analysis would have to be the bank itself and the workers as described above.
2. Data Description
The dataset contains customer information from a bank and includes fields such as customer ID, gender, age, geography, credit score, and balance. This data is relevant to the question as it provides insights into customer demographics, financial behavior, and interactions with the bank’s services.
3. Data Collection
The data was obtained from Kaggle. Here is the link to the kaggle website: https://www.kaggle.com/datasets/divu2001/customer-churn-rate
I downloaded the csv file “Churn_Modelling.csv” which contained all the data I used in my code and analysis. Here is a snippet of the data when it was first loaded into my python file:
4. Data Cleaning
The data cleaning involved dropping irrelevant columns to the analysis, encoding a few categorical variables, and handling missing values if any. I removed the columns RowNumber, Surname, and Exited. Additionally I also set the index to CustomerId. For the encoding, I changed the gender so that Male is 0 and Female is 1. Also, I changed the geography so that France is 0, Spain is 1 and Germany is 2. By performing all of this data cleaning, we ensure that all the data used in the kMeans cluster analysis is numerical. Here is a snippet of the data after the cleaning:
5. Similarity Measurement
In this analysis, the Euclidean distance is used as a measure of similarity between data points. Features such as gender, age, geography, credit score, balance, tenure, numOfProducts, HasCrCard, isActiveMember and EstimatedSalary are used to calculate the distance between customers.
6. Selection of K
For determining the value of k, which represents the number of clusters used in the analysis, I decided to use the elbow method. The elbow graph shows the different values of K on the x-axis and the corresponding within cluster sum of square (WCSS) values on the y-axis. To determine the optimal k, one just has to look for the area of the graph where an elbow forms.The optimal K value is the point at which the graph forms an elbow In this case, K=10 was chosen based on the elbow method. To do this in python, I simply calculated the WCSS values and then created a plot, using the calculated values for the y-axis and the k values for the x axis. Here is a snippet of the graph that was created:
7. Cluster Interpretation
Following the elbow method’s recommendation, I used k=10 for the number of clusters. Here is the number count that resulted from using 10 clusters:
Each cluster seems to represent a group of customers with similar characteristics. First, here is the head of the data printed out, with the added column cluster:
Below are some takeaways I came away with when looking through the data. When reading through each takeaway, it is important to note that I found that not every single person in each cluster follows the exact pattern I listed. For example, in a cluster that tends to be female, there are still some males in that cluster as well.
Cluster Zero: Cluster zero tends to be people with the following characteristics: Females between the ages of 35–45 with medium credit scores, a high balance and a medium estimated salary.
Cluster One: Cluster one tends to be people with the following characteristics: Females between the ages of 35–45, from either France or Spain, with high credit scores, a high balance and a high estimated salary.
Cluster Five: Cluster Five tends to be people with the following characteristics: Males with low balances, low credit scores and a medium estimated salary. The age did not seem to have a factor with this cluster as it was all over the place unlike clusters one and two.
Cluster Six: Cluster Six tends to be people with the following characteristics: Males between the ages of 45–60, from Germany, a high balance, a medium credit score and a medium estimated salary.
These are just examples of a few of the connection features between a few of the clusters. If you would like to see the connections for every cluster, go look at the output in the python code and change the value in the head function to print out as many customers as you want to see.
8. Answer to the Question
The analysis reveals that each cluster group has distinct features and attributes that connect them together. However, even though the elbow method suggested 10 clusters, it seems that there is also some overlap between some of the groups. For example, both clusters zero and one tend to be females between the age of 35–45. No two groups are the exact same though, as the other characteristics for clusters zero and one separate the clusters from each other. By understanding these differences, the bank can tailor marketing strategies and customer retention programs to address the needs of different customer groups and therefore reduce churn rates.
9. Limitations
Here is a list of some limitations of the analysis.
- The dataset from kaggle may be biased, as data is only given from three different countries.
- There is not any specific bank name listed on the kaggle website, so it is difficult to determine which bank is being studied here.
- The analysis is based on static customer data and does not consider dynamic factors influencing churn, such as recent interactions or feedback.
- There is missing data on customer diversity. Age and gender are included, but there are other diversity factors that are not listed
10. GitHub Repository
Here is the link to the Github Repository with the code for this assignment: https://github.com/DrossTheBoss/INST414-ModuleAssignment4