Customer Segmentation Using RFM Analysis in Python

Ankur Tomar
8 min readAug 1, 2017

--

Introduction

Retail industry, an early adopter of data warehousing, has largely benefited from the capacity and capability of data warehouses such as Teradata, Oracle, etc. to host an exceptionally large amount of data. On the daily basis, colossal amount of data is generated across the stores.

The main purpose of storing such a huge amount of data is to analyse the behaviour of customers and make strategies accordingly in order to drive more profits. But the amount of data generated is so huge that it makes any analysis a very difficult task.

Hence it’s prudent that we filter out a chunk of significant customers out of the whole population called a sample or test data .This small chunk of customers will be a representative of the entire population and any analysis carried out on this group will be generalised on the entire population.

In this article, we will see how customers can be segmented in different segment along with the code in Python.

For the analysis, we will use a retail customer data of 4373 customers having a time period of 1 year.

Click here to download the data.

Data Visualization

Before proceeding with the analysis, let us give a detailed look at the data.

Let’s start by firing up the required libraries:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

color = sns.color_palette()

Loading the Dataset in Python:

data=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\RFM\\OnlineRetail.csv”)

data.head(10)

Columns Description :

· InvoiceNo : Unique ID for every time a consumer visits the store

· StockCode: Unique ID for a product

· Description : Description of the product

· Quantity : No of units bought

· InvoiceDate : Date and time of purchase

· Unit Price : Price for 1 unit of the product

· CustomerId : Unique ID for each customer

· Country : Country where transaction happened

First we will calculate the total sales for each product by multiplying Quantity with Unit Price. Also, let us convert Invoice Date in YYYYMM format.

Below code extract the year string after the last “/” and month before first “/” and fix the month length to two number i.e making “7” as “07”.

data[‘Total_Price’]=data[‘Quantity’]*data[‘UnitPrice’]

data[‘date’]=data[‘InvoiceDate’].str.extract(‘(.*)/’).str.extract(‘(.*)/’)

data[‘date’]=data.date.astype(str).str.zfill(2)

data[‘date’]=data[‘InvoiceDate’].str.extract(‘/(.*) ‘).str.extract(‘/(.*)’) + data[‘date’]

data.date = pd.to_numeric(data.date, errors=’coerce’)

Checking country wise customer distribution:

Cust_country=data[[‘Country’,’CustomerID’]].drop_duplicates()

#Calculating the distinct count of customer for each country

Cust_country_count=Cust_country.groupby([‘Country’])[‘CustomerID’].\

aggregate(‘count’).reset_index().sort(‘CustomerID’, ascending=False)

#Plotting the count of customers

country=list(Cust_country_count[‘Country’])

Cust_id=list(Cust_country_count[‘CustomerID’])

plt.figure(figsize=(12,8))

sns.barplot(country, Cust_id, alpha=0.8, color=color[2])

plt.xticks(rotation=’60')

plt.show()

It can be seen from the graph that around 90% of the customers exist in United Kingdom only followed by Germany and France. Hence, the company should focus on the United Kingdom for it marketing and other strategies.

Now we will start with the RFM analysis. We will tag each customer on a scale of 1 to 5 for each of the criterion where 5 will be the highest and 1 will be the lowest tagging. The customers on the top of all the three criterion will be the most significant customers.

RFM (Recency, Frequency, Monetary) analysis

RFM criterion is one of the oldest and most widely used technique for selecting the most significant customers. It supports the selection of customers that are most recent (R), frequent (F), and add a larger monetary value (M) in every transaction.

· Recency (R): Recency is the most important predictor of who is more likely to respond to an offer. Customers who have purchased recently are more likely to purchase again when compared to those who did not purchase recently.

To tag the customers on the basis of recency flag, we will first filter the customers of United Kingdom and take distinct dates of their purchase.

Cust_date_UK=data[data[‘Country’]==”United Kingdom”]

Cust_date_UK=Cust_date_UK[[‘CustomerID’,’date’]].drop_duplicates()

We will tag a customer from 1 to 5 in steps of 2 months i.e. 5 if customer bought in last 2 months else 4 if made a purchase in last 4 months and so on. Finally for a customer, maximum of the recency flag is taken as final recency flag.

def f(row):

if row[‘date’] > 201110:

val = 5

elif row[‘date’] <= 201110 and row[‘date’] > 201108:

val = 4

elif row[‘date’] <= 201108 and row[‘date’] > 201106:

val = 3

elif row[‘date’] <= 201106 and row[‘date’] > 201104:

val = 2

else:

val = 1

return val

Cust_date_UK[‘Recency_Flag’] = Cust_date_UK.apply(f, axis=1)

Cust_date_UK = Cust_date_UK.groupby(“CustomerID”,as_index=False)[“Recency_Flag”].max()

Let us check the distribution of Recency flags:

plt.figure(figsize=(12,8))

sns.countplot(x=”Recency_Flag”, data=Cust_date_UK, color=color[1])

plt.ylabel(‘Count’, fontsize=12)

plt.xlabel(‘Recency_Flag’, fontsize=12)

plt.xticks(rotation=’vertical’)

plt.title(“Frequency of Recency_Flag”, fontsize=15)

plt.show()

It can be seen that around 2200 customers out of 3900 customers have purchased in last 2 months. Note that some customers have not visited the store in last 4–8 months. To regain that lost customer base, business should look out for the reasons why these customers stop visiting the stores.

· Frequency (F): The second most important factor is how frequently these customers purchase. The higher the frequency, the higher is the chances of these responding to the offers.

To tag the customers on the basis of frequency flag, we will take country wise count of distinct invoice number for each customer and then filter out the customers of United Kingdom.

After that, invoice count will be split into 5 equal parts to rank the customers on a scale of 1 to 5 where 5 being the most frequent.

Cust_freq=data[[‘Country’,’InvoiceNo’,’CustomerID’]].drop_duplicates()

#Calculating the count of unique purchase for each customer

Cust_freq_count=Cust_freq.groupby([“Country”,”CustomerID”])[“InvoiceNo”].aggregate(“count”).\

reset_index().sort(‘InvoiceNo’, ascending=False)

Cust_freq_count_UK=Cust_freq_count[Cust_freq_count[‘Country’]==”United Kingdom”]

unique_invoice=Cust_freq_count_UK[[‘InvoiceNo’]].drop_duplicates()

# Dividing in 5 equal parts

unique_invoice[‘Freqency_Band’] = pd.qcut(unique_invoice[‘InvoiceNo’], 5)

unique_invoice=unique_invoice[[‘Freqency_Band’]].drop_duplicates()

unique_invoice

The “qcut” function will divide the entire range of unique invoice count in 5 equal parts (Click Here for more details on qcut function). Below is the output:

Tagging customers in the range of 1 to 5 based on the count of their unique invoice where 5 corresponds to those customers who visit the store most often:

def f(row):

if row[‘InvoiceNo’] <= 13:

val = 1

elif row[‘InvoiceNo’] > 13 and row[‘InvoiceNo’] <= 25:

val = 2

elif row[‘InvoiceNo’] > 25 and row[‘InvoiceNo’] <= 38:

val = 3

elif row[‘InvoiceNo’] > 38 and row[‘InvoiceNo’] <= 55:

val = 4

else:

val = 5

return val

Cust_freq_count_UK[‘Freq_Flag’] = Cust_freq_count_UK.apply(f, axis=1)

Let us check the distribution of Frequency flags:

plt.figure(figsize=(12,8))

sns.countplot(x=”Freq_Flag”, data=Cust_freq_count_UK, color=color[1])

plt.ylabel(‘Count’, fontsize=12)

plt.xlabel(‘Freq_Flag’, fontsize=12)

plt.xticks(rotation=’vertical’)

plt.title(“Frequency of Freq_Flag”, fontsize=15)

plt.show()

It can be seen that most of the customers are visiting the store less than 13 times a year. Therefore, now it will be interesting to see what the variation is in the Monetary value that these customers contribute.

· Monetary Value (M): The third factor is the amount of money these customers have spent on purchases. Customers who have spent higher contribute more value to the business as compared to those who have spent less.

To tag the customers on the basis of monetary flag, we will take country wise sum of Total price for each customer and then filter out the customers of United Kingdom.

#Calculating the Sum of total monetary purchase for each customer

Cust_monetary = data.groupby([“Country”,”CustomerID”])[“Total_Price”].aggregate(“sum”).\

reset_index().sort(‘Total_Price’, ascending=False)

Cust_monetary_UK=Cust_monetary[Cust_monetary[‘Country’]==”United Kingdom”]

Notice that there are some negative values in the total price column. This is the case because when a customer returns the product it purchased, it is stored as a negative value in the quantity column.

Before splitting Total price in 5 parts, we will remove these negative quantities.

unique_price=Cust_monetary_UK[[‘Total_Price’]].drop_duplicates()

unique_price=unique_price[unique_price[‘Total_Price’] > 0]

unique_price[‘monetary_Band’] = pd.qcut(unique_price[‘Total_Price’], 5)

unique_price=unique_price[[‘monetary_Band’]].drop_duplicates()

unique_price

Below is the output:

Tagging customers in the range of 1 to 5 based on their Total price value, where 5 corresponds the customers having highest monetary value:

def f(row):

if row[‘Total_Price’] <= 243:

val = 1

elif row[‘Total_Price’] > 243 and row[‘Total_Price’] <= 463:

val = 2

elif row[‘Total_Price’] > 463 and row[‘Total_Price’] <= 892:

val = 3

elif row[‘Total_Price’] > 892 and row[‘Total_Price’] <= 1932:

val = 4

else:

val = 5

return val

Cust_monetary_UK[‘Monetary_Flag’] = Cust_monetary_UK.apply(f, axis=1)

Let us check the distribution of Monetary flags:

plt.figure(figsize=(12,8))
sns.countplot(x=”Monetory_Flag”, data=Cust_monetory_UK, color=color[1])
plt.ylabel(‘Count’, fontsize=12)
plt.xlabel(‘Monetory_Flag’, fontsize=12)
plt.xticks(rotation=’vertical’)
plt.title(“Frequency of Monetory_Flag”, fontsize=15)
plt.show()

There is an almost equal distribution of customers as far as monetary value is concerned.

Combining all the three flags :

Cust_UK_All=pd.merge(Cust_date_UK,Cust_freq_count_UK[[‘CustomerID’,’Freq_Flag’]],\

on=[‘CustomerID’],how=’left’)

Cust_UK_All=pd.merge(Cust_UK_All,Cust_monetory_UK[[‘CustomerID’,’Monetory_Flag’]],\

on=[‘CustomerID’],how=’left’)

Cust_UK_All.head(10)

From the above table, It is very easy to filter out different type of customers based on their Flag value.

Based on business requirements, various cut-offs are imposed on each of the three parameters. After applying the cut-offs, customers can be classified mainly in three segments:

· Customers clearing all the three cut-offs are the best and the most reliable customers. Business should focus on making customised promotional strategies and loyalty schemes for these customers in order to retain this valuable customer base.

· Customers failing the recency criterion only are those customers who have stopped visiting the store. Business should focus on these customers and look out for the reason why they abandoned visiting the stores.

· Customers clearing the recency criterion but failing frequency criterion are the new customers. Business should provide more incentives and offers to these customers and try to retain these new customers.

Apart from segmenting customers, business can also use RFM criterion to filter out a reliable customer base and perform analysis like Market Basket Analysis to see customer buying pattern or assess the success of marketing strategies by analysing the response of these customers.

Please go to my GitHub repository to access all the codes.

--

--

Ankur Tomar

Data Science | Machine Learning | EXL Services | MS Business Analytics @ University of Minnesota