Customer Characterization and Profiling using Agglomerative Hierarchical Clustering

Navigating consumer diversity with precision insights from advanced clustering techniques

Ujwal Kandi
8 min readJan 7, 2024

Team: Abhinav Sharma, Harini Ala, Nirjari Mehta, Shivam Bhardwaj, Srushti Nandal, Ujwal Kandi

Photo by Jezael Melgoza on Unsplash

Customer Characterization and Profiling (CCP) is an in-depth approach to identifying and comprehending the distinctive traits of an enterprise’s ideal client groups. It involves a thorough analysis of client habits, necessities, and concerns, offering businesses key insights into their clientele. The project aims to leverage clustering techniques, specifically KMeans and hierarchical clustering to identify distinct customer segments. The dataset provides detailed insights into ideal customers, including demographic information, education, marital status, income, and a history of purchases and responses to marketing campaigns.

Data Set Overview

We initiated our analysis by loading the “marketing_campaign.csv” dataset. The dataset provides detailed insights into ideal customers for a business. It encompasses demographic information, education, marital status, income, and a history of purchases and responses to marketing campaigns.

The dataset includes various attributes such as customer demographics (e.g., birth year, education), household details (e.g., marital status, income), and detailed information about product purchases and responses to promotional campaigns.

marketing_campaign.csv
marketing_campaign.csv

Data Exploration and Preprocessing

Our project commenced by setting the foundational framework, involving the importation of key libraries indispensable for data analysis and visual interpretation. We were able to identify and analyze patterns such as the age distribution of our customer base, assess the impact of educational backgrounds on parenting styles, and determine the average time frame required for converting prospects into new customers.

Age Distribution | How long does it take to acquire a new customer? | Educational Profile

This thorough exploration not only provided us with a clearer understanding of the dataset but also laid the groundwork for more advanced analytical techniques in subsequent stages of the project. For visualization, we utilized the powerful graphical tools offered by Seaborn and Matplotlib, enabling us to transform our data into insightful visual representations.

Customer Retention Duration Distribution | Which month is favorable for customer acquisition?

Dimensionality Reduction with PCA

In our approach to simplify and streamline the complexity of our dataset, we implemented Principal Component Analysis (PCA). This powerful technique reduced the dimensionality of our data while preserving its essential characteristics, thus enabling us to represent it in a more manageable three-dimensional space. This reduction not only facilitated easier visualization and interpretation but also enhanced the efficiency of subsequent analytical processes.

The graph generated by PCA visualizes the reduced-dimensional representation of the original dataset in a three-dimensional space.
The graph generated by PCA visualizes the reduced-dimensional representation of the original dataset in a three-dimensional space.

KMeans Clustering

Next, we leveraged the KMeans clustering algorithm to delineate and identify unique customer segments within our dataset. To ascertain the most effective number of clusters, we utilized the elbow method, a technique that helps in determining the point beyond which increasing the number of clusters leads to diminishing returns in terms of variance explained.

The optimal number of clusters for KMeans clustering is based on the “elbow” point in the distortion score plot, aiding in the selection of an appropriate number of clusters for subsequent analysis of the reduced-dimensional data obtained through PCA.
The optimal number of clusters for KMeans clustering is based on the “elbow” point in the distortion score plot, aiding in the selection of an appropriate number of clusters for subsequent analysis of the reduced-dimensional data obtained through PCA.

The clusters thus identified were then represented through a vivid 3D scatter plot, providing a clear and intuitive visual depiction of the different customer groups and their characteristics.

The distinct clusters are differentiated by color, providing 5 identified clusters based on the optimal number determined by the elbow method.
The distinct clusters are differentiated by color, providing 5 identified clusters based on the optimal number determined by the elbow method.

Agglomerative Hierarchical Clustering

To delve deeper into the layered structure of our customer data, we employed agglomerative hierarchical clustering. This method offered a nuanced exploration of the data’s hierarchical organization. We used a dendrogram, a tree-like diagram, to effectively determine the most suitable number of clusters.

The dendrogram showcases the hierarchical relationships and the optimal number of clusters, aiding in the identification of distinct clusters based on the chosen truncation level.
The dendrogram showcases the hierarchical relationships and the optimal number of clusters, aiding in the identification of distinct clusters based on the chosen truncation level.

The insights gleaned from this method were again presented in the form of a 3D scatter plot, offering a different perspective and deeper understanding of customer segmentation, reflective of the inherent relationships and patterns within the dataset.

The hierarchical approach in cluster identification yields a linkage-driven arrangement of data points, offering a more interconnected view of the data distribution within the reduced feature space, in contrast to the more isolated clusters generated by KMeans.
The hierarchical approach in cluster identification yields a linkage-driven arrangement of data points, offering a more interconnected view of the data distribution within the reduced feature space, in contrast to the more isolated clusters generated by KMeans.

Recommendations for Final Model

As we reached the culmination of our project, we presented tailored recommendations for selecting the most suitable clustering model. Our analysis suggested that businesses could opt for either KMeans or Agglomerative Hierarchical Clustering, depending on their unique needs and the specific characteristics of their data.

For businesses seeking a straightforward, efficient approach to segmenting large datasets, KMeans clustering could be the ideal choice. It’s particularly effective in scenarios where the number of clusters can be predetermined or estimated. This model is renowned for its simplicity and speed, making it a practical choice for quick segmentation tasks.

On the other hand, Agglomerative Hierarchical Clustering would be a more fitting choice for businesses that require a more nuanced understanding of their customer base. This method is particularly beneficial when the dataset contains complex, layered relationships that a simpler clustering method like KMeans might not fully capture. It’s also advantageous in situations where the number of clusters is not known in advance, as it allows for a more organic development of customer segments.

Ultimately, the decision between these two models should be guided by the specific requirements of the business, the nature of the data at hand, and the desired depth of customer segmentation. Each method has its strengths and is best suited to different types of clustering challenges.

Reasons to select Agglomerative over K-means

The decision to select Agglomerative Hierarchical Clustering over KMeans was a carefully considered choice, grounded in the following reasons and their extended implications:

1. Optimal Fit for Complex Data Structures

Agglomerative Hierarchical Clustering excels in capturing the intricate structures inherent within complex datasets. Its ability to intricately map out various patterns and relationships makes it especially suited for datasets that are not straightforward and contain multiple layers of information.

2. Flexibility in Cluster Determination

This method stands out for its dynamic approach to determining the number of clusters, in contrast to KMeans which necessitates a predetermined number. This inherent flexibility is crucial when dealing with datasets where the optimal number of clusters isn’t clear, allowing for a more organic and accurate segmentation process.

3. Enhanced Resilience to Data Anomalies

The progressive linkage strategy of Agglomerative Hierarchical Clustering imparts a high level of tolerance towards outliers. This approach ensures that the presence of anomalous data points does not unduly skew the overall clustering results, leading to more reliable and representative segmentation.

4. Stability in the Face of Outliers and Noise

Agglomerative clustering’s methodology, focusing on merging similar data points rather than relying on centroid calculations like KMeans, renders it less susceptible to the disruptive effects of outliers and noisy data. This attribute ensures that the clustering results are both stable and resilient, accurately reflecting the true nature of the dataset.

Ultimately, the adoption of Agglomerative Hierarchical Clustering is a strategic fit for the project, aligning seamlessly with the dataset’s unique characteristics and analytical goals. Its adept handling of unknown cluster numbers, robustness against outliers, and resistance to noise makes it an ideal tool for the intricate task of customer profiling, setting a precedent for future data-driven business strategies.

Insights and Observations

Our project yielded a wealth of invaluable insights into the dynamics of customer behavior, preferences, and spending habits. These observations are instrumental for businesses looking to enhance customer engagement and drive sales. Here are some expanded insights:

1. Comprehensive Customer Segmentation

We were able to categorize customers into distinct groups based on a combination of factors including their income levels, spending habits, and preferences for certain products. This segmentation is crucial for businesses to understand the diverse needs and expectations of their customers.

Plots for the specified attributes, showcasing the distribution of educational levels and living arrangements within each identified cluster, providing insights into the demographic composition of the customer segments
Plots for the specified attributes, showcasing the distribution of educational levels and living arrangements within each identified cluster, providing insights into the demographic composition of the customer segments

2. Identification of Premium Customer Groups

A significant finding was the recognition of a segment of high-value customers. These individuals are characterized by their higher income brackets and their tendency to spend more on specific product categories. Targeting these customers can be particularly beneficial for businesses focusing on high-end products or services.

The average spending distribution across different clusters highlights the variations in spending behavior among the identified customer segments, where clusters 1 and 3 have the most spending power compared to the rest.
The average spending distribution across different clusters highlights the variations in spending behavior among the identified customer segments, where clusters 1 and 3 have the most spending power compared to the rest.

3. Demographic Insights and Behavioral Patterns

Our analysis brought to light how various demographic factors such as age, family size, and possibly educational background influence customer behavior.

The scatter plot comparing income and spending for each customer offers insights into the relationship between income levels and spending patterns where we can see that cluster 3 followed by 1 earn and spend the most.
The scatter plot comparing income and spending for each customer offers insights into the relationship between income levels and spending patterns where we can see that cluster 3 followed by 1 earn and spend the most.

For instance, younger customers might have different spending habits compared to older customers, and families might prioritize different products compared to single individuals.

Cluster 1, characterized by the highest proportion of graduates/postgraduates, exhibits elevated spending on wine and meat, suggesting a potential inclination towards socializing. In contrast, Cluster 2, despite having lower income, demonstrates a notable preference for gold purchases, hinting at a possible investment motive.
Cluster 1, characterized by the highest proportion of graduates/postgraduates, exhibits elevated spending on wine and meat, suggesting a potential inclination towards socializing. In contrast, Cluster 2, despite having lower income, demonstrates a notable preference for gold purchases, hinting at a possible investment motive.

4. Response to Marketing Initiatives

Another key observation was understanding how different customer segments react to promotional campaigns. This insight is vital for businesses to design effective marketing strategies that resonate with each customer group, thereby maximizing the impact of their promotional efforts.

These insights collectively empower businesses to make informed decisions about product development, marketing strategies, and customer engagement tactics. Understanding these diverse customer dynamics is key to fostering stronger customer relationships and driving sustainable business growth.

The bar plot reveals distinct preferences in purchasing channels within each customer cluster and highlights that Cluster 3, characterized by the highest earnings/spending, makes the fewest Deals purchases among all segments.
The bar plot reveals distinct preferences in purchasing channels within each customer cluster and highlights that Cluster 3, characterized by the highest earnings/spending, makes the fewest Deals purchases among all segments.

Significance for Businesses

The Customer Characterization & Profiling (CCP) project underscored the crucial role of strategic customer analysis for business success.

The bar plots illustrate the mean values of numerical features across different customer clusters providing a visual comparison of key attributes beyond commodities and purchase places within each segment. For example, the life_time_value subplot reveals that Cluster 3 is the most valuable customer group and NumWebVisitsMonth indicates that Cluster 2 has the highest level of online engagement and interaction.
The bar plots illustrate the mean values of numerical features across different customer clusters providing a visual comparison of key attributes beyond commodities and purchase places within each segment. For example, the life_time_value subplot reveals that Cluster 3 is the most valuable customer group and NumWebVisitsMonth indicates that Cluster 2 has the highest level of online engagement and interaction.

The data reveals distinct customer segments based on their income, spending patterns, and purchasing behavior.

By employing advanced clustering techniques, the project facilitated several key business strategies:

1. Targeted Product and Marketing Customization. The insights gained from customer segmentation allow for the tailored development of products and marketing strategies. By understanding the unique needs and preferences of each customer group, businesses can create more relevant and appealing offerings, leading to increased customer satisfaction and loyalty.

2. Resource Optimization in Innovation and Promotion. The project’s findings aid in the strategic allocation of resources, particularly in areas of product development and marketing. By identifying which customer segments are most lucrative or responsive, businesses can focus their innovation and promotional efforts more efficiently, ensuring a better return on investment.

3. Focused Marketing Efforts. Understanding the responsiveness of different customer segments to various marketing strategies enables businesses to prioritize their efforts effectively. This targeted approach ensures that marketing resources are not wasted on unresponsive segments, but rather concentrated on those that yield the highest engagement and conversion rates.

4. Assessment of Campaign Effectiveness. The ability to evaluate the success of past marketing campaigns within each identified customer cluster is another critical advantage. This retrospective analysis helps businesses understand what worked and what didn’t, allowing them to refine their strategies for future campaigns.

Conclusion

Overall, the CCP project provides businesses with a more nuanced and data-driven approach to customer engagement. By leveraging the insights from customer characterization and profiling, businesses can enhance their product offerings, streamline marketing strategies, and ultimately achieve greater market success.

The fusion of meticulous data exploration, effective dimensionality reduction, and the application of sophisticated clustering algorithms can equip businesses with a critical understanding of their varied customer base. The insights and recommendations drawn from this study provide a solid foundation for businesses to apply these findings in practical scenarios, enhancing customer engagement and strategic decision-making.

--

--

Ujwal Kandi

Graduate student @ McCombs School of Business - UT Austin