Customer Personality Analysis Segmentation (Clustering)

Data Science use case: customer personality analysis k-means and agglomerative clustering

Andhika Widyadwatmaja
11 min readNov 23, 2021

This dataset I got from kaggle, so I think anyone is allowed to use it.

What Is Customer Segmentation?

The purpose of customer segmentation is to divide customers into many different ways. Customers can be grouped by their demographic, behavior, lifestyle, psychographic, value, etc. Segmentation is mostly used for marketing, but there are other reasons to segment your customer base. Using customer segmentation in marketing means that you can target the right people with the right messaging about your products. This will increase the success of your marketing campaigns.

Understanding The Data

Context

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

Content

People

  • ID: Customer’s unique identifier
  • Year_Birth: Customer’s birth year
  • Education: Customer’s education level
  • Marital_Status: Customer’s marital status
  • Income: Customer’s yearly household income
  • Kidhome: Number of children in customer’s household
  • Teenhome: Number of teenagers in customer’s household
  • Dt_Customer: Date of customer’s enrollment with the company
  • Recency: Number of days since customer’s last purchase
  • Complain: 1 if customer complained in the last 2 years, 0 otherwise

Products

  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • MntGoldProds: Amount spent on gold in last 2 years

Promotion

  • NumDealsPurchases: Number of purchases made with a discount
  • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

  • NumWebPurchases: Number of purchases made through the company’s web site
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s web site in the last month

Goal

Need to perform clustering to summarize customer segments.

Exploratory Data Analysis

Of course, first step is to import the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Alright, let’s see how scary the data is!

data = pd.read_csv('https://raw.githubusercontent.com/andhikaw789/Customer-Personality-Analysis/main/marketing_campaign.csv', sep='\t')data.head()

Not so bad is it? Now, because there’s birth year and Dt_cutomer, I’m going to change birth year to age (to 2021) with the assumption that this is a new dataset, and Dt_customer to date time format and change it to years_customers, which means how many years has the customer been a customer.

from datetime import date
from datetime import datetime
data['Age'] = (2021 - data['Year_Birth'])data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format='%d-%m-%Y')days_in_year = 365.2425date_now = datetime.strptime('Jan 1 2021', '%b %d %Y')data['Years_customer'] = (pd.Timestamp('now').year) - (pd.to_datetime(data['Dt_Customer']).dt.year)

I also want to sum up the total expenses and total accepted campaign for each customers.

data['Total_Expenses'] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] + data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']data['Total_Acc_Cmp'] = data['AcceptedCmp1'] + data['AcceptedCmp2'] + data['AcceptedCmp3'] + data['AcceptedCmp4'] + data['AcceptedCmp5'] + data['Response']

Now, let’s see the data information

data.info()

Seems like there’s some null values. Well let’s see

data.isna().sum()

Ok, 24 null values from the income column. We’ll handle those null values later. Let’s see if there’s any duplicated data, looking by the customer ID column.

data['ID'].nunique()

2240 ID are unique, so no duplicated data.

Data Visualization

plt.figure(figsize=(11,14), facecolor='lightyellow')data['Age'].value_counts().sort_index(ascending=False).plot(kind='barh')plt.title('Age')

As we can see from the age graph, most customers are from the range of 43–56 years old.

plt.figure(figsize=(10,10), facecolor='lightyellow')sns.set(style='whitegrid')ax = sns.histplot(data=data, x='Income', binwidth=10000, kde=True)ax.set_title('Income')

As we can see from the income graph, most customers have the income range of 30000–800000.

plt.figure(figsize=(8, 9), facecolor='lightyellow')sns.set(style='whitegrid')ax = sns.countplot(data=data, x='Education', saturation=1, alpha=0.9, palette='rocket', order=data['Education'].value_counts().index)ax.set_title('Education')for p in ax.patches:ax.annotate(f'\n{p.get_height()}', (p.get_x()+0.4, p.get_height()), ha='center', va='top', color='white', size=11)plt.show()

Based on the education graph, most customers comes from the graduated education background.

plt.figure(figsize=(9, 8), facecolor='lightyellow')sns.set(style='whitegrid')ax = sns.countplot(data=data, x='Marital_Status', saturation=1, alpha=0.9, palette='rocket', order=data['Marital_Status'].value_counts().index)ax.set_title('Marital_Status')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,5), textcoords='offset points', color='black', fontweight='semibold', fontsize=9)

Based on the marital status graph, most customers are already married.

plt.figure(figsize=(8, 8), facecolor='lightyellow')sns.set(style='whitegrid')ax = sns.countplot(data=data, x='Kidhome', saturation=1, alpha=0.9, palette='rocket', order=data['Kidhome'].value_counts().index)ax.set_title('Kid home')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,5), textcoords='offset points', color='black', fontweight='semibold', fontsize=10)

Based on the kid home graph, most customers don’t have any kids.

plt.figure(figsize=(8, 8), facecolor='lightyellow')sns.set(style='whitegrid')ax = sns.countplot(data=data, x='Teenhome', saturation=1, alpha=0.9, palette='rocket', order=data['Teenhome'].value_counts().index)ax.set_title('Teen home')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,5), textcoords='offset points', color='black', fontweight='semibold', fontsize=10)

Based on the teen home graph, most customers don’t have any teens.

plt.figure(figsize=(12,7), facecolor='lightyellow')ax = data[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum().sort_values(ascending=True).plot(kind='barh')plt.title('Expenses', pad=15, fontsize=18, fontweight='semibold')
rects = ax.patchesfor rect in rects:x_value = rect.get_width()y_value = rect.get_y() + rect.get_height() / 2plt.annotate('{}'.format(x_value), (x_value, y_value), xytext=(-49, 0),textcoords='offset points', va='center', ha='left', color = 'white', fontsize=11, fontweight='semibold')

Based on the total expenses graph, wine has the highest sell amount.

plt.figure(figsize=(12,7), facecolor='lightyellow')ax = data[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4','AcceptedCmp5', 'Response']].sum().sort_values(ascending=True).plot(kind='barh')plt.title('Accepted Campaign', pad=15, fontsize=18, fontweight='semibold')rects = ax.patchesfor rect in rects:x_value = rect.get_width()y_value = rect.get_y() + rect.get_height() / 2plt.annotate('{}'.format(x_value), (x_value, y_value), xytext=(-50, 0),textcoords='offset points', va='center', ha='left', color = 'white', fontsize=14, fontweight='semibold')

Based on the accepted campaign graph, we can know that the more campaign done by the company, the more people accepted the campaign. The highest value is on response, which who accepted the last campaign.

plt.figure(figsize=(12,7), facecolor='lightyellow')ax = data[['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']].sum().sort_values(ascending=True).plot(kind='barh')plt.title('Purchases', pad=15, fontsize=18, fontweight='semibold')rects = ax.patchesfor rect in rects:x_value = rect.get_width()y_value = rect.get_y() + rect.get_height() / 2plt.annotate('{}'.format(x_value), (x_value, y_value), xytext=(-50, 0),textcoords='offset points', va='center', ha='left', color = 'white', fontsize=14, fontweight='semibold')

Based on the number purchases category graph, number of purchases made directly in stores is the highest.

sns.set(style='whitegrid')ax = data[['Education','Total_Expenses']].groupby('Education').sum().sort_values(by='Total_Expenses', ascending=False).plot(kind='bar', figsize=(10,8), legend=None, color='blue')plt.xticks(rotation=360)plt.title('Total Expenses by Education', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=13)

As we can see, customer from graduated educational level contributes the highest expense.

sns.set(style='whitegrid')ax = data[['Marital_Status','Total_Expenses']].groupby('Marital_Status').sum().sort_values(by='Total_Expenses', ascending=False).plot(kind='bar', color='blue', figsize=(10,9), legend=None)plt.xticks(rotation=360)plt.title('Total Expenses by Marital Status', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=13)

As we can see, customers that are married contributes the highest expense.

sns.set(style='whitegrid')ax = data[['Kidhome','Total_Expenses']].groupby('Kidhome').sum().sort_values(by='Total_Expenses', ascending=False).plot(kind='bar', color='blue', figsize=(9,9), legend=None)plt.xticks(rotation=360)plt.title('Total Expenses by Kid Home', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=15)

As we can see, customers that doesn’t have any kids at home contributes the highest expense.

sns.set(style='whitegrid')ax = data[['Teenhome','Total_Expenses']].groupby('Teenhome').sum().sort_values(by='Total_Expenses', ascending=False).plot(kind='bar', color='blue', figsize=(9,9), legend=None)plt.xticks(rotation=360)plt.title('Total Expenses by Teen Home', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=15)

As we can see, customers that doesn’t have any teens at home contributes the highest expense.

sns.set(style='whitegrid')ax = data[['Education','Total_Acc_Cmp']].groupby('Education').sum().sort_values(by='Total_Acc_Cmp', ascending=False).plot(kind='bar', figsize=(10,8), legend=None, color='blue')plt.xticks(rotation=360)plt.title('Total Acc Campaign by Education', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=13)

Customers with graduated educational background accepted most campaigns.

sns.set(style='whitegrid')ax = data[['Marital_Status','Total_Acc_Cmp']].groupby('Marital_Status').sum().sort_values(by='Total_Acc_Cmp', ascending=False).plot(kind='bar', figsize=(13,9), legend=None, color='blue')plt.xticks(rotation=360)plt.title('Total Acc Campaign by Marital Status', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=13)

Customers who are married accepted most campaigns.

sns.set(style='whitegrid')ax = data[['Kidhome','Total_Acc_Cmp']].groupby('Kidhome').sum().sort_values(by='Total_Acc_Cmp', ascending=False).plot(kind='bar', figsize=(9,8), legend=None, color='blue')plt.xticks(rotation=360)plt.title('Total Acc Campaign by Kid Home', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=13)

Customers who doesn’t have any kids at home accepted most campaigns.

sns.set(style='whitegrid')ax = data[['Teenhome','Total_Acc_Cmp']].groupby('Teenhome').sum().sort_values(by='Total_Acc_Cmp', ascending=False).plot(kind='bar', figsize=(9,8), legend=None, color='blue')plt.xticks(rotation=360)plt.title('Total Acc Campaign by Teen Home', pad=10, fontsize=15, fontweight='semibold')for p in ax.patches:number = '{}'.format(p.get_height().astype('int64'))ax.annotate(number, (p.get_x() + p.get_width()/2., p.get_height()), ha='center', va='center',xytext=(0,9), textcoords='offset points', color='black', fontsize=13)

Customers who doesn’t have any kids at home accepted most campaigns.

sns.heatmap(data[['Income', 'Total_Expenses','Age', 'Total_Acc_Cmp', 'Recency']].corr(), annot=True)

So, this is the correlation between income, total expenses, age, total accepted campaign, and recency. The highest correlation is income and total expenses, followed by total expenses and total accepted campaign.

Preprocessing

Handling Missing Values

First, let’s handle the missing values. In this case, there’s only one column that had the missing values, which is income. I’m going to impute that null values with the mean value from that column.

data['Income'].fillna(data['Income'].mean(), inplace=True)

Sometimes I like to copy my data so I’ll still able to look through the original data

data_prep = data.copy()

Now, time for encoding. There are 2 features that needs to be encoded. Education and marital status. I’m using label encoding on marital status and ordinal encoding for education.

from sklearn.preprocessing import LabelEncoderlenc = LabelEncoder()lenc.fit(data_prep['Marital_Status'])data_prep['Marital_Status'] = lenc.transform(data_prep['Marital_Status'])from sklearn.preprocessing import OrdinalEncoderedu = ['Basic', 'Graduation', 'Master', '2n Cycle', 'PhD']ore = OrdinalEncoder(categories=([edu]))ore.fit(data_prep[['Education']])data_prep['Education'] = ore.transform(data_prep[['Education']])

Next, I’m dropping features that I think I’m no longer going to need anymore

data_prep = data_prep.drop(['ID', 'Year_Birth', 'Dt_Customer', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5','Response', 'Complain', 'Z_CostContact', 'Z_Revenue'], axis=1)data_proc = data_prep.copy()

Time for scaling. In this case, I’m using standard scaler.

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()std_scaler = np.array(data_proc[['Income','Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth','Age', 'Years_customer', 'Total_Expenses', 'Total_Acc_Cmp']]).reshape(-1,19)
scaler.fit(std_scaler)data_proc[['Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth','Age', 'Years_customer', 'Total_Expenses', 'Total_Acc_Cmp']] = scaler.transform(std_scaler)

Clustering

Finally, down to clustering. In this case I’ll try using K-means and agglomerative clustering

K-means

from sklearn.cluster import KMeanswcss = []for i in range(1,5):kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)kmeans.fit(data_proc)wcss.append(kmeans.inertia_)

Let’s see how many clustering is the best by seeing it with the elbow method

plt.figure(figsize = (10,8))plt.plot(range(1,5), wcss, marker='o')plt.xlabel('Number of Clusters')plt.ylabel('WCSS')plt.title('K-means Clustering')plt.show()

Ok, 2 is the best clustering based on elbow method

kmeans = KMeans(n_clusters=2, random_state=42)kmeans.fit(data_proc)label = kmeans.predict(data_proc)data_segment = data_prep.copy()data_segment['Segments'] = label

Now, let’s look at the people’s characteristics in each cluster

data_segment.groupby(['Segments']).mean()
  • Segment 0 (few opportunities) = Lower income, have more kids, higher deals purchase, higher web visits, low expense, low accepted campaign
  • Segment 1 (well-off) = Higher income, less kids, high web, catalog, and store purchases, high expense, higher accepted campaign

Next, we’ll label it

data_segment['Labels'] = data_segment['Segments'].map({1:'well-off',0:'fewer-opportunities'})

Let’s see what the segmentation looks like

plt.figure(figsize = (10,8))sns.scatterplot(data_segment['Income'], data_segment['Total_Expenses'], hue = data_segment['Labels'], palette= ['g','r'])plt.title('Segmentation K-Means')plt.show()

Agglomerative

from sklearn.cluster import AgglomerativeClusteringagl =  AgglomerativeClustering(n_clusters=2)agl.fit(data_proc)label = agl.labels_data_segment_3 = data_prep.copy()data_segment_3['cluster'] = labeldata_segment_3.groupby(['cluster']).mean()

Pretty much the same as the k-means.

Let’s see the optimum number of clustering based on the silhouette score

for i in range(2,10):agl =  AgglomerativeClustering(n_clusters=i)agl.fit(data_proc)label = agl.labels_score = silhouette_score(data_proc, label)print("Untuk k=",i,"silhouette score=",score)

Ok, it’s the same (2).

Let’s see what the segmentation looks like

Conclusion

The clustering is mostly based on income, expenses, number of purchases by it’s category, and total accepted campaign. Education level, marital status, and age did not affect the clustering. So, there are 2 segmentations which is the best number of clustering based on the model. Segment 0 which the customers had low income and low expenses. Segment 1 which the customers had high income and high expenses, which is better to focus on.

--

--