Customer Segmentation: Kmeans Clustering

Published in

Analytics Vidhya

8 min readAug 14, 2021

Using a public online retailer dataset to segment the customers based on purchases and frequency using K-means Clustering.

Introduction

What is Customer Segmentaion?

As the name suggests, segregating the customers to certain groups based on purchases, frequency of purchases, type of products bought, into homogeneuous groups. In any business, it is important to analyse and retain the existing customers as wells as explore and attract new cutomers.

To certain extend, it is found that customers retaining leads to more effort than exploring new customers. As existing customers are more likely to spend more on the products. Satisfying these customers will help to build large, and strong reliable customer base and also bolsters the repeated purchases of your products.

What features are used to segregate customers?

Segregating customers is based on use-cases used in the respective business.Many business/organisations uses customer segmentation to optimise their ability to sell their products to a myraid groups of customers.

Online Advertising: Type of advertising, where the advertisers use the information from social media profiles and usage, search engine use and habits ( annoying cookies!) and web browsing habits and segment the customers accordinly. Sometimes, even the socio-economic conditions, location and behaviour are used to serve ads.
Health care: Delloite has surveyed 10,000+ US consumers to analyse their attitudes, behaviour and priorities regarding helath care, health insurance and well-being. Has every consumer has different approaches and preferences regarding health care plans, delloite segmented the consumers into 4 groups. This approach helped the health care stakeholders to understand and analyse different types of consumers and provide suitable recommendations to the desired consumers.
Car Companies: In this use-case, the cars are not segmented instead the customers are segmented based on their needs and and the types of cars are the designed solutions to meet the customer needs. Some of the groups are like off roads, environmentally aware, family needs, quality matters, stnd-out from crowd, etc. Here is a good article where porsche uses segmenation technique to target the customers to sell their unique cars.
There are various other industries such as banks and financial services, skin-care and beauty product manufacturers, telivision and mobile networks., etc.

As we have seen above, how the customer segmentation helps to develop the business as well as improve the customer needs.

How the customers are segmented in reality?

Now, to understand how customers are segmented, technically, there are many unsupervised algorithms to solve it. Many of these machine learning algorithms can help companies identify their user/customer base and create required customer groups.

One such algorithm is K-Means Clustering algorithm. This algorithm helps to analyse unlabelled customer data and assign each data point/customers to clusters.

K-Means Clustering

We will be using online retail customer data obtained from UK-based online retail store. We will be helping them to chose the best set of customers who will buy unique all-occassion gifts from the store. The data is available in kaggle.

Required libraries

Make sure you have following libraries installed before analysing the data: pandas, scikit-learn, matplotlib, seaborn. Once installed, let’s build the model!

Reading the data frame

Use the following code to import required libraries:

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

Run the following line to read the dataset and then lets look at the data frame:

data = pd.read_csv('OnlineRetail.csv',
                   encoding = 'unicode_escape')data.head()

There are 8 variables in the dataset.

InvoiceNo: Unique identifier of the transaction done by the customer
StockCode: As it is a wholesale retail store, it has unique identifier for each stock of the item.
Description and Quantity is self explanatory (items and number of items)
InnvoiceDate: Date and time when the payment was done by the customer
UnitPrice: Per unit price of the item
CustomerId: Unique identifier of each customer in the dataset

Data Pre-processing

Let’s remove the missing values as it is not helpful for us. However, removing the missing values is not to be done always. Based on application and use-case, you need to take decision accordingly.

data.info()

As seen that there are missing values of customers, without having CustomerId we have no use of other information so we can drop it.

data.dropna(inplace = True)
data.isna().sum()
data.info()

Lets create total_amount coloumn as you will understand why we are creating this coloumn as you proceed with this project.

data['Total_Amount'] = data['Quantity']*data['UnitPrice']

Before going further, let us understand how are we going to analyse this dataset. There are many ways to analyse this dataset, but we will be seeing RFM analysis. This analysis has been adopted and put into practise since long time ago. It plays a vital role in marketing effort. The three main variables in this analysis:

R (recency) : It stores the number of days the customer has done his last purchase with respect to last date in the dataset. It is just to find the last a particular customer has purchaced from the store.
F (frequency): It is the number of times each customer has made a purchase by counting unique innovice dates each customer was seen making a purchase.
M(Monetary): It is the total amount spent by each customer.

Let’s calculate RFM values.

The easiest is to calcualte the M-monetary value. We will be using the Total_Amount column that we have created before.

m = data.groupby('CustomerID')['Total_Amount'].sum()
m = pd.DataFrame(m).reset_index()
m.head()

Looking at the first few rows of the new dataframe, we can see that we calculated the monetary value for each customer!

Now, let’s calculate the number of times each customer purchased from the store.We will be using the CustomerID and InvoiceDate columns.

f = data.groupby('CustomerID')['InvoiceNo'].count()
f =f.reset_index()
f.columns = ['CustomerID','Frequency']
f.head()

We were able to calculate the total number of times each customer purchased from the store.

Finally, Let’s calculate R-receny value for each customer.

First, we need to find the when was the last purchase done in the data set.

data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'],
                                     format = '%d-%m-%Y %H:%M')
last_day = max(data['InvoiceDate'])

Initially, we are changing the format and datatype of the given InvoiceDate to to the format specified and then calculating the last date of purchase.

To find out the last date of purchase of each customer

data['difference'] = last_day - data['InvoiceDate']
data.head()

Now we need just the number of days but not the time and days attached to it just the integer. So that, it is easier to groupby later on based on each customer.

So we can have a seperate function to give the integer number.

def get_days(x):
    y = str(x).split()[0]
    return int(y)data['difference'] = data['difference'].apply(get_days)

Now, we can groupby each customer by using CustomerId and difference column.

r = data.groupby('CustomerID')['difference'].min()
r = r.reset_index()
r.columns = ['CustomerID','Recency']
r.head()

Now we have created all three seperate dataframes for Recency (r), frequency (f), monetary (m). Let’s group these dataframes.

grouped_df = pd.merge(m, f, on = 'CustomerID',how = 'inner')
RFM_df = pd.merge(grouped_df, r, on ='CustomerID', how = 'inner')
RFM_df.columns = ['CustomerID','Monetary','Frequency','Recency']

Here we are doing inner join to group up 3 dataframes.

As K-means clustering access every data point to form a cluster, having outliers can affect in process of detecting clusters so first lets drop the outliers so that we can get better clusters later on.

Let’s look at the box plot of each column.

plt.boxplot(RFM_df['Monetary'])
plt.xlabel('Monetary')
plt.show()

As each variable as outliers lets drop them.

outlier_vars = ['Monetary','Recency','Frequency']for column in outlier_vars:
    
    lower_quartile = RFM_df[column].quantile(0.25)
    upper_quartile = RFM_df[column].quantile(0.75)
    iqr = upper_quartile - lower_quartile
    iqr_extended = iqr * 1.5
    min_border = lower_quartile - iqr_extended
    max_border = upper_quartile + iqr_extended
    
    outliers = RFM_df[(RFM_df[column] < min_border) |     (RFM_df[column] > max_border)].index
    print(f"{len(outliers)} outliers detected in column {column}")
    
    RFM_df.drop(outliers, inplace = True)

Standardisation

Now we need to standardise the data, as there are larger vlaues that can dominate from defining clusters.

As clustering algorithm is based on distance between the data points, we need to scale the data to follow a normal distribution of mean 0 and standard deviation of 1.

scaled_df = RFM_df[['Monetary','Frequency','Recency']]
scale_standardisation = StandardScaler()rfm_df_scaled = scale_standardisation.fit_transform(scaled_df)rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['monetary','frequency','recency']

K-Means Clustering

First, lets find out number of clusters by elbow method. Elbow method is either used by sum of squared errors (sse) or within cluster sum of errors (wcss). We will use WCSS to find the optimal number of clusters.

k_values = list(range(1,10))
wcss_list = []for k in k_values:
    kmeans = KMeans(n_clusters = k)
    kmeans.fit_transform(rfm_df_scaled)
    wcss_list.append(kmeans.inertia_)plt.plot(k_values,wcss_list)
plt.xlabel("k")
plt.ylabel("WCSS Score")
plt.title("Within Cluster Sum of Squares - by k")
plt.tight_layout()
plt.show()

From the above graph, we can clearly see that when x-axis is on 3, graph clearly has an elbow. So, we will choose the number of clusters or customer groups to be 3.

kmeans = KMeans(n_clusters = 3)
kmeans.fit(rfm_df_scaled)

Let’s visualise the clusters.

clusters = kmeans.labels_RFM = rfm_df_scaled 
RFM['labels'] = clustersfig = plt.figure(figsize=(21,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(RFM["monetary"][RFM.labels == 0], RFM["frequency"][RFM.labels == 0], RFM["recency"][RFM.labels == 0], c='blue', s=60)
ax.scatter(RFM["monetary"][RFM.labels == 1],RFM["frequency"][RFM.labels == 1], RFM["recency"][RFM.labels == 1], c='red', s=60)
ax.scatter(RFM["monetary"][RFM.labels == 2], RFM["frequency"][RFM.labels == 2], RFM["recency"][RFM.labels == 2], c='yellow', s=60)
ax.view_init(30, 185)
plt.show()

Let’s look at the analytics of RFM dataframe.

rfm_df['Clusters'] = k_model.labels_analysis = rfm_df.groupby('Clusters').agg({
    'Recency':['mean','max','min'],
    'Frequency':['mean','max','min'],
    'Monetary':['mean','max','min','count']})

From the above analytics, we can interpret the following.

Thus, we can recommend the customers in cluster ‘0’ to the online store.

Further work ….

You can extend this analysis by using RFM scoring (1–5)and then grouping up the customers. Do check out this article.
As Kmeans is sensitive to outliers, you can try with the same data using DBSCAN clustering which is robust to outliers.
You can further analysis using country column to find out where most of the customers come from.

Customer Segmentation: Kmeans Clustering

Written by Sai Praneeth