Discovering Customer Segments using Machine Learning — Part 1 (Data Exploration)

Sean Yonathan T
Analytics Vidhya
Published in
5 min readOct 11, 2021
https://cpawebsiteimages.blob.core.windows.net/publicimages/Blogs/Segmenting-Consumer-Markets-transparent-bg.png

When we are about to launch new product, services, or marketing strategy, we often wonder about which customer market will be our effective target. On this opportunity, I would like to show you an example of how statistics & machine learning could help you approximate customer’s demographic and behavior patterns.

I tried 2 different approach while working on this data. Both approaches aim to, once again, approximate customer’s demographic and behavior patterns. Both approaches gives more or less the same conclusion. The approaches are:
-Data Dimension Reduction(PCA) then Clustering.
and
-Clustering then check variable importance through Classification Modeling.
I will demonstrate the first approach since it is a more theoretical approach than the second approach.

This writing would cover the data exploration before we move forward to the main topic. The data that we use as the example is obtained from Kaggle.com (https://www.kaggle.com/imakash3011/customer-personality-analysis).

If this part doesn’t interest you and you guys would want to skip directly to the machine learning part, follow me to this link!
https://sea-remus.medium.com/discovering-customer-segments-using-machine-learning-part-2-dimension-reduction-and-clustering-36c6108599f9

Data Description

So the raw data obtained from Kaggle consists of 28 columns, excluding ID. I exclude ID since it doesn’t contain any repeating record, meaning each row is an information of a unique customer.
We have 9 categorical variables and 19 numerical variables.

We can see the definition of each variables on Kaggle except for the Z_CostContact and the Z_Revenue. We should take a look at those variables to see what it really means.

Both variables actually have a constant value. This means that even if we were to keep those variables, they would not contribute on anything at all. We can safely drop them.

Interpreting Interesting Variables

We won’t be visualizing all variables. Instead, we will only take a look at several variables to get the big picture of the customers that we are currently facing.

Year_Birth & Marital_Status

We can see that our data actually consists of customers born within 1940 to 2000. We should drop customers with age less than 1935 since it doesn’t make sense to have a customer that ages 100 years old (eg. born in 1900, so 100 years old on 2000). We should also consider encoding “Alone”, “Absurd”, and “YOLO” on the Marital_Status as “Single” since all three categories seem to indicate a life without any partner.

Education

Education also seems logical but if we pay attention to the NumWebPurchases, we have seen purchases above 20 to be an outlier twice (on Year_Birth plot and this plot). We can also consider removing it from the data.

Income

Okay, we need to remove Income above 400000 since it seems to be an outlier. It’s also needed to be removed since it may affect the clustering that we’re about to do.

Recency

We can see that Recency appears to be having a uniform distribution, the likelihood of a customer to be having a specific Recency is just the same to one another. We will not find a specific cutoff for segments within this variable. We can drop this variable.

Correlation of Spendings

Correlation between amount purchased of each items

We can see that each plot actually showed that the amount of each specific item purchased correlated with every other specific item. This proves that it is okay for us to reduced this variable later.

Correlation of Purchase Location

Correlation also exists between each variables above. The correlation between web purchases and catalog purchases might not be that big. We should also include these variables on the dimension reduction process later.

Dt_Customer (Date since join and Age)

Our data regarding Dt_Customer is actually an “object”. We should change the type of this variable to be a date. Since we don’t have the information of the current date that we last observed the date since join, we can use the maximum value/date of Dt_Customer as the date we last observed our customer joining. From the maximum date of Dt_Customer and, we can also obtain the age of the customer using the Year_Birth if we assumes that the last date of the data observation is equal to the maximum date of Dt_Customer.

Data Cleaning and (a bit of) Feature Engineering

Regarding several variables we have seen above, we can begin cleaning our data and change several variables into a better one.

Outliers

The outliers we see are located after we Income, Year_Birth and Web_Purchases. We’ll be dropping rows according to those variables.

firstdrop = data[data.Income < 400000][:]
seconddrop = firstdrop[firstdrop.Year_Birth > 1935][:]
final = seconddrop[seconddrop.NumWebPurchases < 20][:]

Feature Engineering and Missing Values

We begin wrangling the Dt_Customer, AcceptedCmp and Marital_Status after removing outliers.

import datetime#Dt_Customer for days since join and Age
#we use maximum of Dt_Customer to assume it as our final date of observation
final.Dt_Customer = pd.to_datetime(final.Dt_Customer)
final['days_since_join'] = (max(final.Dt_Customer)-final.Dt_Customer).dt.days
final['Age'] = (max(final.Dt_Customer.dt.year)-final.Year_Birth)
#Handling marital status
final['Marital_Status'] = final['Marital_Status'].replace({'Alone':'Single', 'Absurd':'Single','YOLO':'Single'})
#Handling Binary Variables
#replace acceptedcampaign, complain and response
final[['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Complain','Response']] = final[['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Complain','Response']].replace(0,'No')
final[['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Complain','Response']] = final[['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Complain','Response']].replace(1,'Yes')

We also check for missing values then impute them if it’s possible.

#Check for missing values
print(final.isnull().sum())

We then drop Year_Birth, Dt_Customer, and Recency before exporting the data to our folder for the next step.

datatoclust = final.drop(['Dt_Customer','Recency','Year_Birth'],axis = 1,inplace =False)
datatoclust.to_csv('readytoclust.csv',index = False)

Our data is good to go for clustering!
The next steps, which is the main topic that we’re going to discuss (Dimension Reduction & Clustering), can be seen on the part 2 of this article.

PART 2:
https://sea-remus.medium.com/discovering-customer-segments-using-machine-learning-part-2-dimension-reduction-and-clustering-36c6108599f9

See you there!

--

--

Sean Yonathan T
Analytics Vidhya

A junior data scientist who would also very much like to learn from dear readers.