# Segmentation in Python to find your best Customers

Lets say that you want run a promotional campaign and give personalized offers and discounts to your top customers. But how will you find your top customers? Solution is RFM ( Recency, Frequency, and Monetary) analysis, which is a marketing technique used to determine quantitatively which customers are the best ones by examining how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends (monetary).

We will create a model based on this technique to find out the top ten customers. This blog is inspired from Susan Li’s blog which you can find here. But we will be this doing this a bit differently. You can find the data and full code over here.

Our top customers will be:

- Who have purchased recently (least recency)
- Who have purchased frequently ( high frequency)
- who have high purchase amount (high monetary value)

import pandas as pd

import warnings

warnings.filterwarnings('ignore')

df = pd.read_excel("Online Retail.xlsx")

df.head()

df1 = df

df.head()

Lets find out how many different countries are there in the data. pandas nunique method gives us all the distinct values.

df1.Country.nunique()

Running this we get 38 countries.

Analyzing the data further, we find that most of the data (more than 90%) is for transactions that happened in United Kingdom.

customer_country=df1[['Country','CustomerID']].drop_duplicates()

customer_country.head()

We will first group by country then aggregate over customerID to find out number of customers from each country.

customer_country.groupby(['Country'])['CustomerID'].aggregate('count').reset_index().sort_values('CustomerID', ascending=False)

As most of data belongs from UK we will consider only this data. Lets filter out rest of the it.

df1=df1.loc[df1['Country']=='United Kingdom']

we will check if there are any null values in each column

df1.isnull().sum(axis=0)

So there are 133660 null values in customerID. We will have to remove these nulls.

df1=df1[pd.notnull(df1['CustomerID'])]

df1.isnull().sum(axis=0)

Now our data does not have any null values. But we still need to check for any negative values that might be present.

df1.Quantity.min()

df1.UnitPrice.min()

we see that quantity has some negative values. So we remove them.

df1=df1[df['Quantity']>0]

Now lets see how our data is structured.

df1.info()

df1.shape

So we are dealing with 8 columns and 354345 entries.

Now lets add a total price column which will be UnitPrice * Quantity

df1['Total Price']=df1['UnitPrice']*df1['Quantity']

df1.head()

Find the last order date in our data:

df1.InvoiceDate.max()

Timestamp('2011-12-09 12:49:00')

So for recency we will take 2011–12–10 as our reference date.

import datetime as dt

NOW = dt.datetime(2011,12,10)

Lets build our model now. First we will have to create a data frame with RFM values.

df1['InvoiceDate']=pd.to_datetime(df['InvoiceDate'])

rfmTable=df1.groupby('CustomerID').agg({'InvoiceDate':lambda x:(NOW-x.max()).days,'InvoiceNo':lambda x:len(x),'Total Price':lambda x: x.sum()})

rfmTable.head()

‘Now’ is a date type object; so for data processing, we will convert the InvoiceDate column into pandas date object. Next we groupby CustomerID and aggregate using lambda functions.

- For InvoiceDate- we want the number days between the last date (Now) and last date of purchase for a particular customer.
- len(x) will give us how many distinct times the customer has purchased.
- Total Price is the sum of all the item’s prices in the invoice.

lets rename these columns as frequency, monetary and recency

rfmTable.rename(columns={'InvoiceDate':'Recency','InvoiceNo':'Frequency','Total Price':'monetary_value'},inplace=True)

rfmTable.head()

**Insight: **We can definitely say that 1st customer is not our top customer because he purchased only once 325 days ago.

lets put these customers into bins. We will categorize each customers into quartiles — [0.25,0.5,0,75]. So 0.25 is 25th percentile, 0.5 is 50th percentile and so on.

quantiles = rfmTable.quantile(q=[0.25,0.5,0.75])

quantiles = quantiles.to_dict()

quantiles

What this means is- lets take frequency: for 0.25 quantile we have 17. So this says that 25% of values are less than 17, 50% values are less than 41 and so on.

rfmTable.describe()

our aim is to find customers who have lowest recency, and high monetary value and frequency.

We will define a function that will assign scores to the customer based on their RFM values.

def RScore(x,p,quantiles):

if x<=quantiles[p][0.25]:

return 4

elif x<=quantiles[p][0.5]:

return 3

elif x<=quantiles[p][0.75]:

return 2

else:

return 1

def FMScore(x,p,quantiles):

if x<=quantiles[p][0.25]:

return 1

elif x<=quantiles[p][0.5]:

return 2

elif x<=quantiles[p][0.75]:

return 3

else:

return 4

So for recency, If the actual value of number of days of customer’s most recent purchase is less than 17, we will assign that customer a score of 4, If it is less than 50 a score of 3 and so on.

We will do exactly opposite for Frequency and monetary value as we want those values to be high.

segmented_rfm['r_quartile'] = segmented_rfm['Recency'].apply(RScore, args=('Recency',quantiles,))

segmented_rfm['f_quartile'] = segmented_rfm['Frequency'].apply(FMScore, args=('Frequency',quantiles,))

segmented_rfm['m_quartile'] = segmented_rfm['monetary_value'].apply(FMScore, args=('monetary_value',quantiles,))

the parameters x=> Recency series in segmented_rfm.

p=> 1 st argument in args. (Recency or Frequency or monetary_value)

segmented_rfm.head()

Now all we have to do is add these individual rfm scores, and sort out for the highest scores.

segmented_rfm['RFMScore']= segmented_rfm.r_quartile.map(str)+segmented_rfm.f_quartile.map(str)+segmented_rfm.m_quartile.map(str)

segmented_rfm[segmented_rfm['RFMScore']=='444'].sort_values('monetary_value',ascending=False).head(10)

That’s it! We have our top ten customers to whom we can send the promotional offers. Happy Marketing !