Customer Segmentation using Python

Nehla Shajahan

Published in

Nerd For Tech

13 min readJul 3, 2021

Learn how to deploy a K Means Clustering algorithm step by step in Python for Customer Segmentation.

“We are surrounded by data, but starved for insights.” -Jay Baer, marketing and customer experience expert

Introduction

We know that Machine Learning is predominantly classified into supervised, unsupervised, and reinforcement learning.

Today, we will be looking at one of the most widely used unsupervised learning techniques, K Means Clustering.

“Cluster Analysis or Clustering is a multivariate statistical technique that groups observations based on some of the features or variables they are described by.”

Now, you may ask what necessarily is the difference between classification (supervised learning) and clustering (unsupervised learning). The former revolves around predicting an output category given input data or in other words, we train the model on the training data and then use it to predict future outcomes. Clustering, however, is based on grouping data points together based on similarities among them and differences from others. The output we get is something that we must name ourselves unlike the case of supervised learning where we have labeled data.

The objective of this process is to maximize the similarity of observations within a cluster and maximize the dissimilarity between clusters. Basically to group or ‘cluster’ all the data available into non-overlapping sub-groups that are distinct from each other.

Major applications of clustering are customer segmentation, image segmentation, etc.

Customer segmentation is the process of segregating a company’s potential customer base into discrete groups based on their needs, buying characteristics, etc. By doing so, businesses can better target individuals and augment sales by providing the customers with more tailored shopping experiences.

This article will focus on how we can carry out customer segmentation using K Means Clustering in Python.

We will be covering the following topics:

Data pre-processing for K-Means Clustering.
EDA and Visualizations of the dataset.
Building a K-Means Clustering model from scratch.
Visualizations, interpretations, and analysis of the clusters built.

Let’s get to coding now!

Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

We load our required dataset into the Python environment using the following code. Here, we assign ‘CustomerID’ as the index as it is the unique identifier for each customer.

data = pd.read_csv (‘Mall_Customers.csv’, index_col=’CustomerID’)

To check what’s inside our loaded dataset:

data.head()

The dataset initially consisted of 5 columns; ‘CustomerID’, ‘Gender’, ‘Age’, ‘Annual Income(k$)’, and ‘Spending Score(1–100)’. ‘CustomerID’ in no way helps us in our clustering analysis and can either be dropped or in this case assigned to as the index or identifier.

The information about the other 4 features are as follows:

Gender: It is a categorical data type that must be pre-processed and converted into numeric before deploying it into the model. It has 2 categories, i.e., ‘Male’ and ‘Female’
Age: It is of a numerical data type and contains the age of the customers(in years)
Annual Income: It is of a numerical data type and contains the annual income of the customers in 1000 Dollars.
Spending Score: It is also of a numerical data type and contains a score between 1–100 for a customer based on their spending behavior.

Detecting and Handling Outliers

Outliers are observations that differ from the overall pattern of the sample dataset. Wikipedia defines it as ‘an observation point that is distant from other observations.’ ML algorithms do not work well in the presence of outliers.

Now let us see if our given dataset contains any outliers using a boxplot.

plt.boxplot(data[‘Annual Income’])
plt.title(‘Boxplot of Annual Income’)

The figure shows that there are a few outliers in the ‘Annual Income’ column. We will handle these outliers using the IQR or the interquartile range method.

You can learn more about outlier detection and removal here.

Q1=np.percentile(data[‘Annual Income’],25,interpolation=’midpoint’)
Q2=np.percentile(data[‘Annual Income’],50,interpolation=’midpoint’)
Q3=np.percentile(data[‘Annual Income’],75,interpolation=’midpoint’)IQR=Q3-Q1
lower_limit=Q1–1.5*IQR
upper_limit=Q3+1.5*IQRoutlier=[]
for x in data['Annual Income']:
    if((x>upper_limit)or(x<lower_limit)):
        outlier.append(x)

The outliers were found to be [137,137], both of which are above the upper limit. We will now find the respective indices and drop the entire two rows from the analysis.

PSA: It is okay to keep the outliers in the analysis if they are few in number.

outlier_index=data[‘Annual Income’]>up_limit
data.loc[outlier_index].index

data.drop([199, 200],inplace=True)

Exploratory Data Analysis

Before we jump to machine learning or clustering, it is important to perform some EDA and visualizations. This helps us to identify patterns within the data and find interesting relationships among the variables.

You can learn more about EDA here.

1. Univariate Analysis

Univariate is the analysis of a single variable. This helps us to describe the data and find patterns that exist within it.

For instance, we find the number of observations in each category of the feature ‘Gender’ using countplot()

sns.countplot(data['Gender'])

There are more females in the dataset when compared to males.

Now, let us look at the univariate distribution of the feature ‘Age’ using distplot(). This allows us to look at the parametric distribution of our data.

sns.distplot(data['Age'], bins=30)

It is visible that the age column contains values that are almost normally distributed. The KDE on the distplot looks like a bell curve, meaning the majority of data points of the ‘Age’ column are relatively similar.

You can learn more about distplot and KDE here.

2. Bivariate Analysis

Bivariate Analysis like the name suggests is the analysis of two variables together to examine any concurrent relation between the two.

Now, we will check for any association between each of our features. However, one thing that must be taken into notice is that Bivariate Analysis using scatter plot can only be performed on numerical data types and hence we will have to convert our categorical feature ‘Gender’ to a numerical datatype.

Treating the categorical feature: We can assign or map the values of ‘Male’ and ‘Female’ entries in the given dataset as shown below.

gender= {'Male':0, 'Female':1}
data['Gender']= data['Gender'].map(gender)

Let’s get back to our bivariate analysis!

A scatter plot is an important visualization method used to observe the relationship between variables. It uses dots to represent values for two different numeric variables.

2.1. Scatter plot between ‘Age’ and ‘Spending Score’

plt.figure(figsize=(10,6))
plt.scatter(data['Age'],data['Spending Score'], marker='o');
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.title('Scatter plot between Age and Spending Score')

We can see that the lower the age higher the spending score.

2.2. Scatter plot between ‘Age’ and ‘Annual Income’

plt.figure(figsize=(10,6))
plt.scatter(data['Age'],data['Annual Income'], marker='o');
plt.xlabel('Age')
plt.ylabel('Annual Income')
plt.title('Scatter plot between Age and Annual Income')

People between the age group of 30–50 years get the highest annual income.

2.3. Scatter plot between ‘Annual Income’ and ‘Spending Score’

plt.figure(figsize=(10,6))
plt.scatter(data['Annual Income'],data['Spending Score'], marker='o');
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Scatter plot between Annual Income and Spending Score')

We can see that roughly Annual income of $40–70k corresponds to a 40–60 spending score.

2.4. Scatter plot between ‘Gender’ and ‘Spending Score’

plt.figure(figsize=(6,6))
plt.scatter(data['Gender'],data['Spending Score'], marker='o');
plt.xlabel('Gender')
plt.ylabel('Spending Score')
plt.title('Scatter plot between Gender and Spending Score')

The spending score corresponding to females (mapped to 1) is slightly higher than the spending score of males. Females are marginally more likely to spend more.

2.5. Scatter plot between ‘Gender’ and ‘Annual Income’

plt.figure(figsize=(6,6))
plt.scatter(data['Gender'],data['Annual Income'], marker='o');
plt.xlabel('Gender')
plt.ylabel('Annual Income')
plt.title('Scatter plot between Gender and Annual Income')

There is hardly any difference in the annual income concerning gender after removing outliers.

2.6. Scatter plot between ‘Gender’ and ‘Age’

Few males in the dataset are slightly older than the females.

3. Multivariate Analysis

A pivotal part of the exploratory data analysis is to look for the correlation between variables and the most widely used method to visualize correlation matrices is heatmaps.

A heatmap is a graphical representation of data in which values are represented as colors.

fig_dims = (7, 7)
fig, ax = plt.subplots(figsize=fig_dims)
sns.heatmap(data.corr(), annot=True, cmap='viridis')

We see that ‘Age’ is negatively correlated with ‘Spending Score’.
‘Annual Income’ is very less correlated with ‘Age’
‘Annual Income’ and ‘Spending Score’ is also very less correlated.
‘Gender’ is very less correlated with ‘Spending Score’ but more correlated, when compared to ‘Annual Income’.

Standardizing the variables

Before we deploy the model, we will have to standardize all variables in the dataset to get them around the same scale. This important step in pre-processing is called Standardization.

To learn more about standardization, you can read this linked article.

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data[['Age', 'Annual Income', 'Spending Score']])

Head of the dataset after standardization

Building the Clustering model

The algorithm used: K means Clustering

K means algorithm is an unsupervised learning technique used to solve clustering problems in machine learning. It groups unlabelled data into K different clusters.

“It is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.”

In k means clustering, we have to specify the number of clusters we want the data to be grouped into. Initially we randomly assign a value to the model and further, we use the ‘Elbow Method’ to find the optimal number of clusters.
The algorithm randomly assigns each observation to a cluster and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

1. Reassign data points to the cluster whose centroid is closest.

2. Calculate the new centroid of each cluster.

These two steps are repeated till the within-cluster sum of squares (WCSS) or the sum of squared errors cannot be reduced any further.
The WCSS is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

#create a copy of the data variable
x = data.copy()
# The number in the brackets is K, or the number of clusters we are aiming for, here we take 3 randomly
kmeans = KMeans(3)
# Fit the data
kmeans.fit(x)# Create a copy of the input data
clusters = x.copy()
# Take note of the predicted clusters 
clusters['cluster_pred']=kmeans.fit_predict(x)# Plot the data using the Annual Income and the Spending Score
plt.scatter(clusters['Annual Income'],clusters['Spending Score'],c=clusters['cluster_pred'],cmap='rainbow')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')

The above scatter plot gives us a rough idea of the optimal number of clusters. But to find the most appropriate ‘K’, we use The Elbow Method.

You can take the help of this YouTube video to get a further understanding of the K means algorithm and the Elbow Method.

wcss=[]for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300, n_init = 10, random_state = 42)
    kmeans.fit(x)
    wcss_iter=kmeans.inertia_
    wcss.append(wcss_iter)# Plotting the results onto a line graph to help us observe 'The Elbow'
plt.figure(figsize=(10,5))
no_clusters=range(1,11)
plt.plot(no_clusters, wcss,marker = "o")
plt.title('The elbow method', fontweight="bold")
plt.xlabel('Number of clusters(K)')
plt.ylabel('within Clusters Sum of Squares(WCSS)')

We can see that the optimal number of clusters is 5, the point that which WCSS stops decreasing drastically.

Now let's look into yet another clustering metric to evaluate

Silhouette Score

A silhouette score is a metric used to measure how dense and well separated the clusters are. The metric ranges from -1 to 1.

The higher the silhouette score, the better the model as it measures the distance between a data point and the data points in the nearest cluster.

Learn more about silhouette score here.

Let's calculate the score for our model:

print(silhouette_score(clusters, kmeans.labels_, metric='euclidean'))

The silhouette score for the model is 0.37 which is a pretty decent value.

Now we assign the optimal number of clusters as 5 and create a new data frame with the predicted clusters. Moreover, we map our categorical feature ‘Gender’ back to its initial categories.

kmeans_new = KMeans(5)
#Fit the data
kmeans_new.fit(x)#Create a new data frame with the predicted clusters
clusters_new = x.copy()
clusters_new['cluster_pred'] = kmeans_new.fit_predict(x)#mapping the gender variable back to 'male' and 'female'
gender= {0:'Male',1:'Female'}
clusters_new['Gender']= clusters_new['Gender'].map(gender)

Head of the dataset with the predicted clusters

To get a more comprehensible understanding of the predicted clusters, let’s visualize:

plt.figure(figsize=(6,6))
plt.scatter(clusters_new['Annual Income'],clusters_new['Spending Score'],c=clusters_new['cluster_pred'],cmap='rainbow')
plt.title("Clustering customers based on Annual Income and Spending score", fontsize=15,fontweight="bold")
plt.xlabel("Annual Income")
plt.ylabel("Spending Score")

Cluster Analysis

From the final plot, we can perceive that the customers present in our dataset could be clustered into 5 distinct groups based on their annual income and spending score.

Red: low annual income, low spending score
Orange: low annual income, high spending score
Violet: intermediate annual income, intermediate spending score
Blue: high annual income, high spending score
Green: high annual income, low spending score

Let us delve further more into these clusters for better understanding.

To begin with, to study the attributes of each of the clusters, let's find the average of all features across each cluster.

avg_data = clusters_new.groupby(['cluster_pred'], as_index=False).mean()
avg_data

A picture indeed speaks a thousand words. So, let’s visualize the above table using bar graphs.

sns.barplot(x='cluster_pred',y='Age',palette="plasma",data=avg_data)sns.barplot(x='cluster_pred',y='Annual Income',palette="plasma",data=avg_data)sns.barplot(x='cluster_pred',y='Spending Score',palette="plasma",data=avg_data)

Gender breakdown

We also need to understand the gender divide of each cluster

data2 = pd.DataFrame(clusters_new.groupby(['cluster_pred','Gender'])['Gender'].count())
data2

Clusters 0,1,2,3 have a higher proportion of females than males and cluster 4 has an almost equal proportion of both.

PS: It is important to note that the dataset had a pre-dominantly higher female population.

Major attributes of each predicted cluster

The following conclusions can be drawn about the built clusters:

Cluster 0: Blue- high annual income, high spending score

The average age is 32 years; predominantly female; Average Annual Income is 85k in dollars; Average Spending Score is 82

2. Cluster 1: Violet-intermediate annual income, intermediate spending score

The average age is 43 years; predominantly female; Average Annual Income is 55k in dollars; Average Spending Score is 49

3. Cluster 2: Orange- low annual income, high spending score

The average age is 25 years; predominantly female; Average Annual Income is 26k in dollars; Average Spending Score is 78

4. Cluster 3: Red- low annual income, low spending score

The average age is 45 years; predominantly female; Average Annual Income is 26k in dollars; Average Spending Score is 21

5. Cluster 4: Green- high annual income, low spending score

The average age is 41 years; there is an almost equal proportion of males and females; Average Annual Income is 86k in dollars; Average Spending Score is 17

Building consumer personas around each cluster

Now that we have created the customer clusters, we can build personas around them to make the customer experience gleeful.

Being able to tell stories around the developed analysis is a pivotal skill that helps the clients or stakeholders to understand our findings more efficiently.

Cluster 0: Highly well off customers

This segment consists of middle-aged individuals who worked up a significant amount of wealth in their initial years.

They also have a large spending scale and hence lead a very affluent lifestyle.

Suggestion: The majority of the people in this age group might be in their initial phase of building up a family. Hence, promoting real estate, properties, car deals, etc is most likely to catch their attention. These people are presumed to make serious financial commitments out of all clusters due to their high spending capacity.

Cluster 1: Middle-class customers

This cluster consists of middle-aged customers who spent and earn money on an intermediate level.

They are careful with their spending scale as their income levels are not excessive.

These people might also be the ones with higher financial responsibilities. For instance, higher education of their children.

Suggestion: Discounts offers, Promo codes, loyalty cards, etc.

Cluster 2: Impetuous buyers

This segment consists of youngsters who are careless with their spending habits.

This group is likely to be comprised of first jobbers who tend to spend above their means in the pursuit of a good lifestyle.

Suggestions: Since a lot of youngsters love the idea of vacationing, providing this segment with adequate hotel coupons, flight offers might be a good idea. This group is also deemed to spend the maximum on clothing and related accessories when compared to other clusters. Providing adequate discounts for such goods might also be a useful strategy.

Cluster 3: The almost pensioned ones

This segment consists of the older population who earn and spend less. They might be saving up for retirement.

Suggestions: Healthcare-related products can be promoted amongst this cluster. Usage of adequate discount coupons, promo codes, etc might also help.

Cluster 4: The cautious spenders

This cluster consists of middle-aged people who are frugal about their spending habits.

Although the income levels are high, they spend very little. This might be an indicator of their financial responsibilities.

Suggestion: Membership cards, discount coupons, offers, etc could have a drastic impact on this cluster.

Conclusion

To sum up, we have efficiently created a K means clustering algorithm to segment customers based on their annual income and spending habits.

We performed exploratory data analysis on the dataset, pre-processed it, and built the required model.

Furthermore, we analyzed the developed clusters and gave some business recommendations which can be used for improving businesses.

The entire code for the above project can be accessed at my GitHub repository.