Customer Segmentation

Brandon Fung
INST414: Data Science Techniques
5 min readApr 1, 2024

Introduction

Customer segmentation is a potent analytical strategy that categorizes a company’s customers into distinct groups based on shared characteristics, enhancing the precision and effectiveness of marketing efforts, product development, and customer service initiatives. A pivotal question that can be illuminated through customer segmentation is, “Which customer segments are most likely to be interested in our new product line?” Product managers and marketing teams are particularly vested in this inquiry, as its answers can directly inform targeted marketing campaigns, product positioning strategies, and inventory planning. The data essential for addressing this question would typically include customer demographics (age, gender, income level), past purchase history (products bought, purchase frequency), and engagement metrics (website visits, email open rates). This information is crucial because it enables the identification of patterns and preferences within customer segments, guiding the development of personalized marketing messages and offers that resonate with the potential buyers’ specific needs and interests.

Data Collection and Cleaning

I sourced a subset of the most ideal data from Kaggle, providing me with the crucial features needed for my customer segmentation project.

Here is a snippet of the data collected:

Faced with a mix of categorical (like gender, marriage status, college graduation status, profession, and spending score) and numerical variables (such as age, years of work experience, and family size), I embarked on the essential task of encoding the categorical features. I opted for one-hot encoding for the nominal variables and ordinal encoding for those with a natural order, converting them into a machine-understandable format and ensuring they could be effectively processed by the clustering algorithm.

Here is what the data looks like after encoding:

Note: There are more columns for all the different types of professions that are not shown

For measuring the similarity between data points, a cornerstone of the k-means-clustering method, I leaned on a blend of these encoded features. Numerical features’ similarity was calculated using the Euclidean distance, measuring the straight line distance between points in the multidimensional space of our data. For the encoded categorical features, the focus was on evaluating similarity based on the presence of matching categories post-encoding. This method allowed me to navigate through the diversity of the data, ensuring every attribute — whether it painted a picture of demographics, lifestyle, or spending habits — played a role in uncovering the subtleties of customer similarities and differences. This approach paved the way for defining distinct and insightful customer segments, turning data into actionable insights.

Choosing the Optimal Number of Clusters

To determine the optimal number of clusters for my customer segmentation, I employed the elbow method, a widely recognized technique that involves plotting the within-cluster sum of squares (WCSS) against the number of clusters. Starting with a broad range of potential cluster counts, I iteratively ran the KMeans clustering algorithm for each possible number of clusters. After each iteration, I calculated the WCSS, which measures the compactness of the clusters and hence the effectiveness of the segmentation. By plotting these values, I looked for the point where the decrease in WCSS began to diminish significantly. This point, visually resembling an “elbow” on the graph, signaled the most appropriate number of clusters to be four. This method was instrumental in balancing the granularity of the customer segmentation with the practicality of implementing distinct marketing strategies for each segment, ensuring an efficient and targeted approach to customer engagement.

Below is a graph of the WCSS:

Analysis

Upon analyzing the characteristics and patterns within each of the four clusters derived from my dataset, it became apparent that each cluster represented a distinct customer profile based on their behaviors and attributes. For instance, cluster 0 seemed to encapsulate young professionals (average age of 27), who have generally not graduated college, and have a low spending score. One such individual in this cluster is a 22-year-old male, who is not married nor graduated college with about 1 year of work experience in the healthcare industry and has a low spending score.

On the other hand, cluster 1 appeared to consist of older (average age of 54), more educated individuals who have a low to average spending score. One person in this cluster is a 63-year-old male, who is married and graduated college with about 8 years of work experience in the homemaker industry and has an average spending score.

These distinctions allowed me to deduce that Cluster 0 likely represents a segment that values growth, learning, and tech-savviness, making them ideal targets for the latest technological products and professional development opportunities. Conversely, Cluster 1 likely represents a more mature segment, prioritizing stability and long-term investments, thus being more receptive to products and services geared towards home and family life.

Limitations

While the segmentation analysis provided valuable insights into customer behavior and preferences, it’s important to acknowledge its limitations and potential biases. Firstly, the reliance on pre-existing data from Kaggle means the analysis might not fully capture the diversity and nuances of the entire customer base; certain groups could be underrepresented depending on the dataset’s original collection methodology. Additionally, the decision to encode categorical variables and the choice of using the elbow method for determining the number of clusters may introduce biases. The encoding process can sometimes oversimplify complex human attributes, while the subjective nature of identifying the “elbow” point could lead to different interpretations of the optimal number of clusters. Moreover, the analysis does not account for temporal changes in customer behavior — what was true at the time of data collection might not hold in the future, limiting the analysis’s long-term applicability. To mitigate these limitations, future work could include gathering more comprehensive, real-time data directly from a broader customer base and exploring more dynamic clustering techniques that adapt over time, ensuring that the segmentation remains relevant and accurately reflects evolving customer profiles.

A link to the entire code and dataset can be found here.

--

--