Customer Segmentation Analysis Using K-Means: A Practical Guide

12 min readFeb 6, 2024

Switch Language: ES

Explore the precision of the K-means algorithm in segmenting complex datasets into coherent clusters. This concise guide highlights its ability to reveal critical insights and hidden patterns, simplifying data analysis for strategic customer segmentation.

Customer segmentation stands as a pivotal technique for enterprises aiming to gain deeper insights into their customer base and refine their marketing and sales approaches. In this discourse, we shall delve into the utilization of the K-Means clustering algorithm for segmenting customers, employing Python as our tool of choice. To illustrate this methodology, a dataset originating from a shopping mall will be utilized, serving as a practical example to guide our exploration.

1. “Mall Customers” Dataset Overview

The “Mall Customers” dataset is frequently employed in machine learning endeavors, particularly for exercises focused on clustering and customer segmentation analysis. This dataset comprises a curated collection of features designed to facilitate the exploration of consumer behavior patterns within a retail context. The attributes included in the “Mall Customers” dataset are as follows:

Customer ID: Serves as a distinctive identifier for each individual customer, ensuring anonymity while allowing for precise tracking and analysis.
Gender: Categorizes customers into male and female groups, allowing for gender-based segmentation and analysis.
Age: Represents the customer’s age, offering insights into demographic distributions and preferences.
Annual income (K$): Indicates the customer’s estimated annual income in thousands of dollars, providing a quantitative measure of their economic status.
Spending Score (1-100): A metric assigned by the mall to reflect the customer’s purchasing behavior and expenditure pattern, on a scale from 1 to 100. This score is derived from a combination of factors including transaction frequency, amount spent, and shopping preferences, offering a nuanced view of consumer engagement.

This dataset is instrumental in developing predictive models and strategies for targeted marketing, personalized shopping experiences, and enhanced customer relationship management.

import pandas as pd
dataset = pd.read_csv("Mall_Customers.csv")

**Table 1**: Output of ***Mall_Customers****.csv.*

In this analysis, our attention will be centered on two key variables: Annual income (K$)and Spending Score (1-100). To proceed, we will extract these specific columns from the dataset and transform them into a NumPy ndarray. Subsequently, this transformed data will be designated as X, which represents our feature matrix. This strategic approach facilitates the utilization of these variables as the primary inputs for our subsequent modeling and analysis stages.

import numpy as np
X = dataset.iloc[:, [3, 4]].values

**Output 1**: Result of the feature matrix X (`ndarray`).

Through the visualization of these two variables via a scatter plot, we gain an initial insight into the distribution of customers:

plt.figure(figsize=(8, 5))
plt.scatter(X[:, 0], X[:, 1], color='blue', marker='o')
plt.title('Scatter Plot')
plt.xlabel('Annual Income (k$) ')
plt.ylabel('Spending Score (1-100)')
plt.grid(True)
plt.show()

**Chart 1**: Scatter plot of the feature matrix X showing the relationship between the variables Annual Income (K$) and Spending Score (1–100).

Upon examining the initial scatter plot, it is incumbent upon us, as data scientists, to pose the following critical inquiry:

Do discernible patterns or inherent clusters emerge from the data?

To tackle this question, it is imperative to determine whether customers can be segmented based on shared attributes, such as income and spending scores. This analysis is crucial for elucidating the distinct characteristics of various clusters within the dataset.

In the pursuit of methodically categorizing the data into uniform groups, the application of clustering algorithms becomes essential. This approach not only facilitates a structured organization of data but also unveils the underlying structure, enabling more informed decision-making and strategy development.

2. Clustering Techniques in Data Science

Clustering algorithms serve as a pivotal method in data science for partitioning a dataset into distinct groups or clusters, where members of the same cluster exhibit higher similarity amongst themselves compared to those in different clusters. These techniques are particularly valuable in unsupervised learning contexts, where data lacks predefined labels or categories. Rather than fitting data into pre-established categories, clustering algorithms endeavor to uncover inherent structures or patterns within the data, organizing them into clusters based on these identified similarities. Prominent clustering algorithms include:

K-Means: This algorithm segregates the data into distinct groups, ensuring homogeneity within each cluster.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): It identifies clusters in dense regions of data points, disregarding outliers or less dense areas.
Hierarchical Clustering: It constructs nested clusters in a hierarchical fashion, either by progressively merging or splitting data points.

K-means, in particular, stands out for its utility in the business domain, attributed to its proficiency in deciphering data and extracting actionable insights. It empowers organizations to unearth latent trends within customer data, refine marketing and sales approaches, and make decisions that are both more enlightened and precise. Essentially, K-means simplifies intricate datasets into coherent and actionable insights, thereby facilitating strategic business advancements.

The selection of the K-means algorithm for this example is predicated on its straightforwardness and efficiency, rendering it an exemplary introduction for individuals embarking on the exploration of clustering techniques.

3. Delving Deeper into K-means

Prior to initiating the K-means algorithm, it is imperative to determine the optimal number of clusters (K) for our analysis. This decision poses a significant challenge, as often there is no clear indication of the ideal number of clusters that will most accurately represent the underlying patterns in our data.

For instance, in the scenario outlined, we might set K=1 (which generally lacks practical relevance), or explore a range of values such as K=2, K=3, up to K=n, where n could be, say, 6. Implementing the algorithm across various values of K allows us to evaluate and determine the most appropriate clustering solution.

**Chart 2**: Outcome of Iterating K-Means six times on the Feature Matrix X

But, what is the perfect number of clusters? Let me advance that: it’s when K equals 5.

The Elbow Method and other techniques are employed to estimate the optimal number of clusters (K) in the K-means algorithm. These techniques, particularly the Elbow Method, assess how the within-cluster variability, measured as WCSS (Within-Cluster Sum of Squares), shifts as the number of clusters increases. The goal is to identify a balance point where an increase in the number of clusters does not result in a significant improvement in variability (WCSS).

For further details on what WCSS is and how it’s calculated, I encourage you to check out my post on Medium: Entendiendo el Within Cluster Sum of Squares (WCSS).

Once a value for K has been selected, the next step is to initialize the centroids in the feature space. Centroids are the central points of each cluster and can be initialized randomly or through more sophisticated methods such as K-means++, which strategically selects centroids to enhance the algorithm’s efficiency. (K-means++ can also be used to select K)

The K-means algorithm operates iteratively. First, it assigns each data point to the cluster whose centroid is nearest. Then, it recalculates the centroids of each cluster based on the points assigned to them. This process is repeated until the assignment of points to clusters does not significantly change, indicating that the algorithm has converged to a solution.

In other words, these are the steps to follow:

Estimation of the Number of Clusters (K): We must choose the number of K clusters using any valid method.
Initialization of Centroids: We randomly place the K centroids in the feature space. Note: They do not necessarily have to be points from our original dataset. In Graph 2, we see that many centroids do not coincide with any point in our dataset.
Assignment of Points to Clusters: K-means assigns each dataset point to the nearest centroid. Here we will have the first formation of the K-clusters.
Recalculation of Centroids: Here, the new centroid of each cluster will be calculated and assigned. That is, given the groups that form a cluster, we calculate its centroid or barycenter, so the initial barycenters are recalculated at this step.
Iteration Until Convergence: The steps of assignment and recalculation must be repeated until the assignment of points to clusters stabilizes and does not change significantly.

Let’s go through it step by step.

4. End-to-End K-Means Algorithm

We shall now delineate and elucidate each phase of the K-Means algorithm in a stepwise fashion, ensuring a thorough comprehension of its components.

4.1 Determination of the Optimal Number of Clusters Using the Elbow Method

The pivotal task of ascertaining the optimal number of clusters (K) forms the foundation of our analytical endeavor. Given the absence of a predefined value for K, our initial step involves the derivation of an estimative range, judiciously informed by the business context and analytical objectives at hand. To this end, we establish both lower and upper bounds for K, thereby framing our exploration of the data’s intrinsic clustering tendencies. For illustrative purposes, we might cap K at 10, setting the stage for a methodical investigation into the most efficacious cluster configuration.

The ensuing procedure entails:

Iteratively executing the K-Means algorithm for K values ranging from 1 to 10.
Computing the Within-Cluster Sum of Squares (WCSS) for each iteration.
Employing an elbow plot to discern the inflection point, indicative of a significant change in trend, thereby illuminating the most judicious choice for K.

Here is how we implement this analysis in code:

from sklearn.cluster import KMeans

wcss = []
n_clu = 10

for i in range(1, n_clu+1):
    kmeans = KMeans(n_clusters = i, 
                    init = "k-means++", 
                    max_iter = 300, 
                    n_init = 10, 
                    random_state = 42)
    kmeans.fit(X) 
    
    wcss.append(kmeans.inertia_)

In the iterative process of determining the optimal number of clusters for K-Means, the following steps are meticulously executed for each iteration:

Initialization of the K-Means Algorithm: Employ the Kmeans() method from sklearn to instantiate the algorithm, specifying n_clustersfor the current iteration.
Fitting the Algorithm: Apply the .fit() method to train the K-Means algorithm on the dataset X, allowing it to identify cluster centers.
Computation of Within-Cluster Sum of Squares (WCSS): Utilize the .inertia_attribute to calculate the WCSS, which quantifies the compactness of the clusters.

For each iteration, the configuration parameters of the K-Means algorithm are carefully adjusted as follows:

n_clusters = i: This crucial parameter dictates the number of clusters to be formed in each iteration, with i being dynamically altered throughout the iterative process ( for ).
init = "k-means++": Opting for the k-means++ method for centroid initialization enhances the algorithm’s convergence efficiency by strategically placing initial centroids, leading to more effective clustering outcomes compared to random initialization.
max_iter = 300:Setting a cap of 300 iterations for each run of the algorithm, which is the standard default, restricts the number of centroid updates. This limitation prevents endless or excessively long iterations that do not yield substantial improvements in cluster refinement.
n_init = 10: This default setting specifies the algorithm to be executed 10 times with distinct centroid seeds, enhancing the robustness of the clustering outcome by potentially offering varied solutions based on the initial centroid positions.
random_state = 42: A fixed seed value ensures that the algorithm’s outcomes are reproducible, facilitating consistency in results across different runs.

Following the execution, the process includes the WCSS values, enabling the evaluation of cluster compactness and the determination of the most appropriate number of clusters based on these metrics.

**Output 2**: WCSS values for the different numbers of clusters

Observing the list, we note that the Within-Cluster Sum of Squares (WCSS) values exhibit a progressive decline with the increment in the number of clusters. This diminution is characteristically steep at the outset, decelerating as it nears the optimal cluster count. Similar to the earlier analysis, pinpointing the precise value of K remains challenging; thus, we shall proceed to visualize these data through plotting.

plt.plot(range(1,11), wcss, marker='o')
plt.title("Gráfico de Elbow")
plt.xlabel("Número de Clústers")
plt.ylabel("WCSS(k)")
plt.grid(True)
plt.show()

**Chart 3**: Elbow Method. Representation of the Within-Cluster Sum of Squares (WCSS) by the number of clusters K.

Opting for K = 5 as the optimal cluster count for our dataset strikes an effective balance between computational efficiency and clustering precision. This selection results in a noteworthy decrement in the Within-Cluster Sum of Squares (WCSS), illustrating a more coherent and compact grouping compared to smaller values of K (1 through 4). It highlights the algorithm’s increased capacity to minimize variance within clusters, enhancing the data’s structural clarity.

Conversely, elevating the cluster count beyond five yields only incremental reductions in WCSS. This observation suggests that further segmentation scarcely augments the data’s discriminability. Moreover, it may induce over-segmentation, where the slight accuracy enhancements are outweighed by a heightened complexity in the model’s structure. Thus, K = 5 emerges as a judicious choice, marrying efficiency with meaningful data partitioning, avoiding the pitfalls of unnecessary complexity.

4.2 Implementation of the K-Means Algorithm

With the optimal number of clusters, K, determined, our next step involves segmenting our dataset, X, into 5 distinct clusters.

The strategy entails executing the K-means algorithm with K set to 5, enabling the algorithm to identify the most coherent grouping of data points based on their features.

The implementation process is as follows:

Initialize the K-Means algorithm specifying n_clusters as 5.
Employ the .fit_predict() method to both fit the K-means model to the dataset, X, and concurrently predict the cluster assignment for each data point. This method undertakes two key operations:
fit: This operation trains the K-means model on the dataset, identifying the central points (centroids) of the clusters.
predict: This assigns each data point in X to one of the 5 clusters based on proximity to the centroids.

kmeans = KMeans(n_clusters=5, 
                init="k-means++", 
                max_iter=100, 
                n_init=10, 
                random_state=42)
           
y_kmeans = kmeans.fit_predict(X)
y_kmeans

Upon examining the cluster assignments for the initial 10 observations, we observe:

y_kmeans[0:10]

Output 3: Cluster assignment predictions for the dataset’s first 10 entries

4.3 Incorporating Clustering Outcomes into the Analytical Framework

Upon the successful application of the K-Means algorithm to our dataset and the subsequent allocation of each data point to a designated cluster, it is imperative to assimilate these clustering outcomes back into the original dataset. This assimilation is pivotal for enhancing the depth of our analysis and delivering substantial business insights.

The procedural step involves the transformation of cluster identifiers, as contained within y_kmeans, into a Pandas Series format. Following this, we proceed to merge this series with our primary dataset dataset, incorporating the cluster identifiers as a novel column titled Cluster Pred.

This augmented DataFrame, denominated as X_clustered, offers the dual advantage of allowing an investigation into the intrinsic attributes of each data point, alongside their respective cluster affiliations. Such a comprehensive dataset lays the groundwork for an in-depth, nuanced analysis of the clusters that have been discerned, thereby furnishing a robust platform for data-driven decision-making and strategic insights.

4.4 Visualization of Clusters and Centroids

Upon incorporating the cluster assignments (Cluster Pred) into our dataset, the subsequent phase involves visualizing the results to acquire an intuitive comprehension of the cluster distribution relative to the chosen features.

Selecting an appropriate visualization chart is pivotal for uncovering patterns and trends that might remain obscured through numerical analysis exclusively.

For this purpose, we employ a scatter plot to depict each of the five clusters delineated by the K-Means algorithm. It’s crucial to acknowledge that this visualization strategy is feasible due to our analysis being confined to two variables.

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], 
              s = 100, c = "red", label = "Grupo 1")
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], 
              s = 100, c = "blue", label = "Grupo 2")
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, 
              c = "green", label = "Grupo 3")
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, 
              c = "cyan", label = "Grupo 4")
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, 
              c = "magenta", label = "Grupo 5")
plt.scatter(X[y_kmeans == 5, 0], X[y_kmeans == 5, 1], s = 100, 
              c = "magenta", label = "Grupo 5")

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], 
              s = 100, c = "black", label = "centroide", marker = '^')

plt.title('Clúster de clientes Centro Comercial (k = ' + str(n) +')')
plt.xlabel('Annual Income (k$) ')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

**Chart 4**: Scatter-plot del dataset con la segmentación del K-means grafico de dispersión de los clústeres

Each point depicted on the visualization corresponds to an individual sample from our dataset, X, with its color designation reflective of the cluster affiliation. The distinct hues — red, blue, green, cyan, and magenta — serve to delineate clusters 1 through 5, respectively, enhancing the visual distinction among the groups. In addition to these, the centroids of each cluster are marked with black, triangle-shaped indicators. Centroids are pivotal to the functionality of the K-Means algorithm, acting as the focal points around which samples coalesce. The terminal positions of these centroids denote the gravitational center of their respective clusters.

This graphical representation is instrumental in assessing cluster cohesion and separation.

Optimal outcomes are characterized by clusters that are both well-delineated and compact, signifying precise and significant classifications within the defined feature space — namely, Annual Income (K$) andSpending Score (1-100). This visualization serves as an empirical validation of our parameter selection and the chosen number of clusters for the K-Means algorithm, bolstering our confidence as we proceed with the analysis of the delineated customer segments.