Unveiling Customer Insights: Unsupervised Learning for Dynamic Segmentation

Abdulraqib Omotosho
18 min readJun 18, 2023
Photo by Scott Graham on Unsplash

Customer Personality Analysis is a powerful tool that helps businesses gain a deep understanding of their ideal customers. By using techniques such as unsupervised learning, companies can uncover valuable insights about customer needs, behaviors, and concerns.

This analysis enables businesses to direct their products and services to different customer segments, rather than using a generic approach. For example, instead of marketing a new product to every customer in their database, a company can identify the most receptive customer segment and focus their efforts on that specific group.

In this article, we will explore how unsupervised learning can be used for customer segmentation. We’ll cover the process of analyzing customer data, extracting insights, and applying them to improve business strategies. We will then perform clustering to summarize customer segments.

The dataset used for this project can be found here.

Importing libraries and packages

# data analysis libraries
import pandas as pd
import numpy as np

# data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
import plotly.express as px

# machine learning
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# remove warnings
import warnings
warnings.filterwarnings('ignore')

Loading the dataset and cleaning it

# read csv file into notebook
df = pd.read_csv('marketing_campaign.csv', sep='\t')
df.head()
# concise summary of a DataFrame.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2240 non-null int64
1 Year_Birth 2240 non-null int64
2 Education 2240 non-null object
3 Marital_Status 2240 non-null object
4 Income 2216 non-null float64
5 Kidhome 2240 non-null int64
6 Teenhome 2240 non-null int64
7 Dt_Customer 2240 non-null object
8 Recency 2240 non-null int64
9 MntWines 2240 non-null int64
10 MntFruits 2240 non-null int64
11 MntMeatProducts 2240 non-null int64
12 MntFishProducts 2240 non-null int64
13 MntSweetProducts 2240 non-null int64
14 MntGoldProds 2240 non-null int64
15 NumDealsPurchases 2240 non-null int64
16 NumWebPurchases 2240 non-null int64
17 NumCatalogPurchases 2240 non-null int64
18 NumStorePurchases 2240 non-null int64
19 NumWebVisitsMonth 2240 non-null int64
20 AcceptedCmp3 2240 non-null int64
21 AcceptedCmp4 2240 non-null int64
22 AcceptedCmp5 2240 non-null int64
23 AcceptedCmp1 2240 non-null int64
24 AcceptedCmp2 2240 non-null int64
25 Complain 2240 non-null int64
26 Z_CostContact 2240 non-null int64
27 Z_Revenue 2240 non-null int64
28 Response 2240 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
# checking for nulls
df.isnull().sum()
ID                      0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Z_CostContact 0
Z_Revenue 0
Response 0
dtype: int64
Removing null values (image by Author)
# checking for categorical features
cat_cols = df.select_dtypes(include='object').columns
for cat in cat_cols:
print(f'\nValue counts for: {cat}\n\n {df[cat].value_counts()}')
print('-' * 100)
Value counts for: Education

Graduation 1116
PhD 481
Master 365
2n Cycle 200
Basic 54
Name: Education, dtype: int64
----------------------------------------------------------------------------------------------------

Value counts for: Marital_Status

Married 857
Together 573
Single 471
Divorced 232
Widow 76
Alone 3
Absurd 2
YOLO 2
Name: Marital_Status, dtype: int64
----------------------------------------------------------------------------------------------------

Value counts for: Dt_Customer

31-08-2012 12
12-09-2012 11
14-02-2013 11
12-05-2014 11
20-08-2013 10
..
05-08-2012 1
18-11-2012 1
09-05-2014 1
26-06-2013 1
09-01-2014 1
Name: Dt_Customer, Length: 662, dtype: int64
----------------------------------------------------------------------------------------------------

Feature Engineering

Here, we would engineer new columns from the previously existing ones to enhance the performance of the model.

# customer age: since the customer registration with the company was between 2012 and 2014, we assume that the data was collected in January 2015 for simplicity.
df['Age'] = 2015 - df['Year_Birth']

# total spending
spending_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df['Total_Spending'] = df[spending_cols].sum(axis=1)

# total number of dependents in each customer's household
# df['Family_Size'] = df['Kidhome'] + df['Teenhome'] + 1 # marital status may be included.
df['No_of_Dependents'] = df['Kidhome'] + df['Teenhome']

# total number of campaigns accepted by each customer
campaign_cols = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']
df['Total_Acceptance'] = df[campaign_cols].sum(axis=1)

# number of months since each customer's enrollment with the company until the current date
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
df['Months_of_Enrollment'] = ((pd.Timestamp('2015-01-01') - df['Dt_Customer']) / np.timedelta64(1, 'M')).astype(int)

# most frequent purchasing channel
purchase_cols = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
df['Purchasing_Channel'] = df[purchase_cols].idxmax(axis=1).str.replace('Num', '')

# total purchases
total_purchases_col = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases']
df['Total_Purchases'] = df[total_purchases_col].sum(axis=1)

# website activity
website_cols = ['NumWebPurchases', 'NumWebVisitsMonth']
df['Website_Activity'] = df[website_cols].sum(axis=1)

# total number of campaigns responded to by each customer
response_cols = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']
df['Campaign_Response'] = df[response_cols].sum(axis=1)

# marital status new
df['Marital_Status_New'] = df['Marital_Status'].replace({
'Married': 'Partner',
'Together': 'Partner',
'Single': 'No_Partner',
'Divorced': 'No_Partner',
'Widow': 'No_Partner',
'Alone': 'No_Partner',
'Absurd': 'No_Partner',
'YOLO': 'No_Partner'
})

# education level
df['Education_Level'] = df['Education'].replace({
'Graduation': 'Graduate',
'PhD': 'Post_Graduate',
'Master': 'Post_Graduate',
'2n Cycle': 'Undergraduate',
'Basic': 'Undergraduate'
})

# parental status
df['Parent_Status'] = ((df['Kidhome'] > 0) | (df['Teenhome'] > 0)).astype(int)

df['Income_Ratio'] = df['Total_Spending'] / df['Income']

# dropping redundant columns
df = df.drop([
'Year_Birth', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'Kidhome', 'Teenhome', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Dt_Customer',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'Response', 'Marital_Status', 'Education',
'Z_CostContact', 'Z_Revenue', 'NumWebVisitsMonth', 'NumDealsPurchases'
], axis=1)

An explanation of the new columns engineered from the code above:

1. Age: The “Age” column represents the age of each customer. It is calculated by subtracting the “Year_Birth” column from the year 2015, assuming the data was collected in January 2015 for simplicity.

2. Total_Spending: The “Total_Spending” column represents the total amount spent by each customer on various product categories. It is obtained by summing up the spending columns, including “MntWines,” “MntFruits,” “MntMeatProducts,” “MntFishProducts,” “MntSweetProducts,” and “MntGoldProds.”

3. No_of_Dependents: The “No_of_Dependents” column represents the total number of dependents in each customer’s household. It is calculated by adding the “Kidhome” and “Teenhome” columns.

4. Total_Acceptance: The “Total_Acceptance” column represents the total number of marketing campaigns accepted by each customer. It is obtained by summing up the acceptance columns, including “AcceptedCmp1,” “AcceptedCmp2,” “AcceptedCmp3,” “AcceptedCmp4,” and “AcceptedCmp5.”

5. Months_of_Enrollment: The “Months_of_Enrollment” column represents the number of months since each customer enrolled with the company until the current date. It is calculated by subtracting the customer’s enrollment date (stored in the “Dt_Customer” column) from January 1, 2015, and converting the resulting time difference to the number of months.

6. Purchasing_Channel: The “Purchasing_Channel” column represents the most frequent purchasing channel used by each customer. It is determined by identifying the column with the highest value among “NumWebPurchases,” “NumCatalogPurchases,” and “NumStorePurchases” and removing the “Num” prefix.

7. Total_Purchases: The “Total_Purchases” column represents the total number of purchases made by each customer. It is obtained by summing up the columns related to different purchase types, including “NumWebPurchases,” “NumCatalogPurchases,” “NumStorePurchases,” and “NumDealsPurchases.”

8. Website_Activity: The “Website_Activity” column represents the overall activity of customers on the company’s website. It is calculated by summing up the columns “NumWebPurchases” and “NumWebVisitsMonth.”

9. Campaign_Response: The “Campaign_Response” column represents the total number of campaigns responded to by each customer. It is obtained by summing up the response columns, including “AcceptedCmp1,” “AcceptedCmp2,” “AcceptedCmp3,” “AcceptedCmp4,” “AcceptedCmp5,” and “Response.”

10. Marital_Status_New: The “Marital_Status_New” column represents a new categorization of marital status. It replaces the original marital status values with new labels, such as “Partner” for “Married” and “Together,” and “No_Partner” for other statuses like “Single,” “Divorced,” “Widow,” “Alone,” “Absurd,” and “YOLO.”

11. Education_Level: The “Education_Level” column represents a new categorization of education level. It replaces the original education values with simplified labels, such as “Graduate” for “Graduation,” “Post_Graduate” for “PhD” and “Master,” and “Undergraduate” for “2n Cycle” and “Basic.”

12. Parent_Status: The “Parent_Status” column represents the parental status of each customer. It is calculated by checking if a customer has at least one child at home (indicated by a value greater than 0 in the “Kidhome” column) or at least one teenager at home (indicated by a value greater than 0 in the “Teenhome” column).

13. Income_Ratio: The “Income_Ratio” column represents the ratio of total spending to income for each customer. It is calculated by dividing the “Total_Spending” column by the “Income” column.

Finally, we drop several redundant columns that are no longer needed for analysis, such as columns related to birth year, spending amounts, household information, campaign acceptance, customer enrollment date, website visits, response, marital status, education, and others. These columns are dropped using the drop() function with the specified column names and the `axis=1` parameter to indicate that the columns should be dropped.

random samples of the new data (image by Author)
Descriptive statistics (image by Author)
concise summary of the data (image by Author)

Exploratory Data Analysis

# correlation matrix
corr = df.corr()
plt.figure(figsize=[20, 10])
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

plt.title('Correlation Matrix', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show();
correlation matrix (image by Author)
# income distribution
plt.figure(figsize=[7, 10])
sns.boxplot(y=df['Income'], color='lightblue', linewidth=1.5)

age_stats = df['Income'].describe()
q1 = age_stats['25%']
q3 = age_stats['75%']
median = age_stats['50%']
min_val = age_stats['min']
max_val = age_stats['max']
range_val = max_val - min_val

iqr = q3 - q1
outliers_income = df[(df['Income'] < q1 - 1.5 * iqr) | (df['Income'] > q3 + 1.5 * iqr)]['Income']

plt.text(0.9, q1, f'25th Percentile: {q1:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, median, f'Median: {median:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, q3, f'75th Percentile: {q3:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, min_val, f'Minimum: {min_val:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, max_val, f'Maximum: {max_val:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, min_val + (range_val / 2), f'Range: {range_val:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')

lower_fence = q1 - 1.5 * iqr
plt.text(0.9, lower_fence, f'Lower Fence: {lower_fence:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
upper_fence = q3 + 1.5 * iqr
plt.text(0.9, upper_fence, f'Upper Fence: {upper_fence:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')

plt.title('Income Distribution', fontsize=16)
plt.xlabel('Income', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='x', linestyle='--')
sns.despine()
plt.show();
we can see there are outliers in the Income distribution (image by Author)
# Age distribution
plt.figure(figsize=[7, 10])
sns.boxplot(y=df['Age'], color='lightblue', linewidth=1.5)

age_stats = df['Age'].describe()
q1 = age_stats['25%']
q3 = age_stats['75%']
median = age_stats['50%']
min_val = age_stats['min']
max_val = age_stats['max']
range_val = max_val - min_val

iqr = q3 - q1
outliers_age = df[(df['Age'] < q1 - 1.5 * iqr) | (df['Age'] > q3 + 1.5 * iqr)]['Age']

plt.text(0.9, q1, f'25th Percentile: {q1:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, median, f'Median: {median:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, q3, f'75th Percentile: {q3:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, min_val, f'Minimum: {min_val:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, max_val, f'Maximum: {max_val:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
plt.text(0.9, min_val + (range_val / 2), f'Range: {range_val:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')

lower_fence = q1 - 1.5 * iqr
plt.text(0.9, lower_fence, f'Lower Fence: {lower_fence:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')
upper_fence = q3 + 1.5 * iqr
plt.text(0.9, upper_fence, f'Upper Fence: {upper_fence:.2f}', va='center', ha='left', color='darkblue', fontweight='bold')

plt.title('Age Distribution', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='x', linestyle='--')
sns.despine()
plt.show();
we can see there are outliers in the Age distribution (image by Author)
# Complain distribution
plt.figure(figsize=[10, 7])
counts = df['Complain'].value_counts()
labels = [f"{attendance}\n{count / len(df) * 100:.1f}%" for attendance, count in counts.items()]
colors = ['#8e0201', '#e8cccc']

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2, colors=colors)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Distribution of Complain', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show();
Majority of customers didn’t make a complain in the last 2 years. (image by Author)
# total spending
plt.figure(figsize=[10, 7])
sns.histplot(data=df, x='Total_Spending', color='#8e0201')
plt.title('Total Spending Distribution', fontsize=16)
plt.show();
# no of dependents
plt.figure(figsize=[10, 7])
order = df['No_of_Dependents'].value_counts().index
sns.countplot(df['No_of_Dependents'], order=order, color='#8e0201')
plt.title('No of Dependents in Customers household', fontsize=16)
plt.show();
Majority of the customers have a single dependent. (image by Author)
# Purchasing_Channel distribution
plt.figure(figsize=[10, 7])
counts = df['Purchasing_Channel'].value_counts()
labels = [f"{attendance}\n{count / len(df) * 100:.1f}%" for attendance, count in counts.items()]
colors = ['#8e0201', '#c93432', '#e8cccc']

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2, colors=colors)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Distribution of Purchasing_Channel', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show();
Majority of the customers made purchases via the Store. (image by Author)
# total Total_Purchases
plt.figure(figsize=[10, 7])
sns.histplot(data=df, x='Total_Purchases', color='#8e0201')
plt.title('Total_Purchases Distribution', fontsize=16)
plt.show();
# Website_Activity dist
plt.figure(figsize=[10, 7])
sns.histplot(df['Website_Activity'], kde=True, color='#8e0201')
plt.title('Website Activity Distribution', fontsize=16)
plt.show();
The Website Activity distribution is positively skewed
# Marital_Status_New distribution
plt.figure(figsize=[10, 7])
counts = df['Marital_Status_New'].value_counts()
labels = [f"{attendance}\n{count / len(df) * 100:.1f}%" for attendance, count in counts.items()]
colors = ['#8e0201', '#e8cccc']

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2, colors=colors)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Distribution of Marital Status', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show();
Majority of the customers have partners. (image by Author)
# Months_of_Enrollment distribution
plt.figure(figsize=[10, 7])
sns.histplot(df['Months_of_Enrollment'], color='#8e0201')
plt.title('Months_of_Enrollment Distribution', fontsize=16)
plt.show();
Most customers enrolled between 20 and 25 months ago.
# Education_Level distribution
plt.figure(figsize=[10, 7])
counts = df['Education_Level'].value_counts()
labels = [f"{attendance}\n{count / len(df) * 100:.1f}%" for attendance, count in counts.items()]
colors = ['#8e0201', '#c93432', '#e8cccc']

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2, colors=colors)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Distribution of Education Level', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show();
Majority of the customers have are Graduates. (image by Author)
# Parent_Status distribution
plt.figure(figsize=[10, 7])
counts = df['Parent_Status'].value_counts()
labels = [f"{attendance}\n{count / len(df) * 100:.1f}%" for attendance, count in counts.items()]
colors = ['#8e0201', '#e8cccc']

fig, ax = plt.subplots()
ax.pie(counts, labels=labels, startangle=50, counterclock=False, pctdistance=0.8, labeldistance=1.2, colors=colors)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig.gca().add_artist(centre_circle)
ax.set_title('Distribution of Parental Status', fontsize=16, loc='left', pad=30)
ax.axis('equal')
plt.show();
Majority of the customers are parents. (image by Author)
# Total Spending by Income Distribution
plt.figure(figsize=[10, 7])
sns.scatterplot(data=df, x='Total_Spending', y='Income', color='#8e0201')
plt.title('Total Spending by Income Distribution', fontsize=16)
plt.show();
The Income of customers is positively correlated with the total amount they have spent. (image by Author)
# Days of Enrollment by Age distribution
plt.figure(figsize=[10, 7])
sns.lineplot(data=df, x='Months_of_Enrollment', y='Age', color='#8e0201')
plt.title('Number of Months of Enrollment by Age distribution', fontsize=16)
plt.show();
plt.figure(figsize=[10, 7])
sns.barplot(data=df, x='No_of_Dependents', y='Income', color='#8e0201')
plt.title('Number of dependents Clusters')
plt.xlabel('No of Dependents')
plt.ylabel('Income')
plt.show();
Customers with no dependents/children have the highest income. (image by Author)

Data Preprocessing and Scaling

Here we would check for outliers and then scale the dataset.

# checking for outliers
def print_outliers(column):
q1 = column.quantile(0.25)
q3 = column.quantile(0.75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = column[(column < lower_fence) | (column > upper_fence)]

print(f"Outliers in {column.name}:")
print(outliers)

print_outliers(df['Age'])
print_outliers(df['Income'])
Outliers in Age:
192 115
239 122
339 116
Name: Age, dtype: int64
Outliers in Income:
164 157243.0
617 162397.0
655 153924.0
687 160803.0
1300 157733.0
1653 157146.0
2132 156924.0
2233 666666.0
Name: Income, dtype: float64

Removing outliers in the Age and Income columns is necessary for several reasons. Firstly, outliers in the data, such as ages of 115 or 122, and unusually high incomes like 666,666, may indicate errors or data entry mistakes. Secondly, outliers can have a significant impact on the performance of statistical models, as they can disproportionately influence the results. The removal of outliers ensures the model can focus on the majority of the data and provide more accurate predictions.

# removing outliers
def remove_outliers(column):
q1 = column.quantile(0.25)
q3 = column.quantile(0.75)
iqr = q3 - q1
lower_fence = q1 - (1.5 * iqr)
upper_fence = q3 + (1.5 * iqr)

no_outliers = column[(column >= lower_fence) & (column <= upper_fence)]
return no_outliers

no_outliers_age = remove_outliers(df['Age'])
no_outliers_income = remove_outliers(df['Income'])
df = df[(df['Age'].isin(no_outliers_age)) & (df['Income'].isin(no_outliers_income))]
Correlation matrix after outliers have been removed (image by Author)
# apply label encoding and scale the dataset
le = LabelEncoder()
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
df[col] = le.fit_transform(df[col])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_data, columns=df.columns)
df_scaled.head()
dataset after it has been scaled and encoded (image by Author)

Dimensionality Reduction Using PCA

Dimensionality reduction is a technique used to reduce the number of features or variables in a dataset while preserving the essential information. One commonly used technique for dimensionality reduction is Principal Component Analysis (PCA), which transforms the original features into a new set of orthogonal variables called principal components. Here, I used PCA not only for dimensionality reduction but also to mitigate the impact of highly correlated features, resulting in a more robust and accurate clustering analysis.

# copy the dataset into a new variable X
X = df_scaled.copy()

# Apply PCA for dimensionality reduction. Keep 3 componenets
pca = PCA(n_components=3)

#Fit the model with X and apply the dimensionality reduction on X.
X_pca = pca.fit_transform(X)

# Plot the reduced dimension using PCA
fig = plt.figure(figsize=(10, 8))
x, y, z = X_pca[:, 0], X_pca[:, 1], X_pca[:, 2]
colors = '#8e0201'
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x,y, z, c=colors, marker="d")
ax.set_title('Data in Reduced Dimension')
plt.show()
The plot of the features of the dataset when reduced to three.

Checking the optimal number of clusters

To check the optimal number of clusters, we use the following evaluation techniques namely;

Elbow Curve method

Silhouette score Curve method

Calinski-Harabasz index curve

min_clusters = 2
max_clusters = 10
distortions = []
silhouette_scores = []
ch_scores = []

# Fit K-means and compute the sum of squared distances, silhouette scores, and CH scores
for k in range(min_clusters, max_clusters + 1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df_scaled)
distortions.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_))
ch_scores.append(calinski_harabasz_score(df_scaled, kmeans.labels_))

# Plot the elbow curve
plt.figure(figsize=[10, 7])
plt.plot(range(min_clusters, max_clusters + 1), distortions, marker='d', label='Sum of Squared Distances', color='#8e0201')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Value')
plt.title('Elbow Curve')
plt.legend()
plt.show()

rate_of_change = np.diff(distortions)
rate_of_change_ratio = rate_of_change[1:] / rate_of_change[:-1]
optimal_k_elbow = np.argmax(rate_of_change_ratio) + min_clusters + 1

# Print the optimal number of clusters based on the elbow method
print(f"The optimal number of clusters (Elbow method) is: {optimal_k_elbow}")

# Plot the silhouette scores
plt.figure(figsize=[10, 7])
plt.plot(range(min_clusters, max_clusters + 1), silhouette_scores, marker='d', label='Silhouette Score', color='#8e0201')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Value')
plt.title('Silhouette Analysis')
plt.legend()
plt.show()

# Identify the optimal number of clusters using silhouette score
optimal_k_silhouette = np.argmax(silhouette_scores) + min_clusters

# Print the optimal number of clusters based on the silhouette method
print(f"The optimal number of clusters (Silhouette method) is: {optimal_k_silhouette}")

# Plot the Calinski-Harabasz scores
plt.figure(figsize=[10, 7])
plt.plot(range(min_clusters, max_clusters + 1), ch_scores, marker='d', label='Calinski-Harabasz Index', color='#8e0201')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Value')
plt.title('Calinski-Harabasz Index')
plt.legend()
plt.show()

# Identify the optimal number of clusters using Calinski-Harabasz index
optimal_k_ch = np.argmax(ch_scores) + min_clusters

# Print the optimal number of clusters based on the Calinski-Harabasz index
print(f"The optimal number of clusters (Calinski-Harabasz method) is: {optimal_k_ch}")
The optimal number of clusters (Elbow method) is: 5
The optimal number of clusters (Silhouette method) is: 2
The optimal number of clusters (Calinski-Harabasz method) is: 2

In this case, we would prioritize the Silhouette method and the Calinski-Harabasz method, as they both provide insights into the quality and separation of the clusters.
The Silhouette method suggests 2 clusters, indicating that the data points within each cluster are relatively well-separated from points in other clusters. This could imply a clear distinction between two major customer segments.
The Calinski-Harabasz method also suggests 2 clusters, indicating a high inter-cluster variance compared to the intra-cluster variance. This suggests that the clusters are well-separated and distinct from each other.

Agglomerative Clustering

Now that I have reduced the attributes to two dimensions, I will be performing segmentation by clustering via Agglomerative clustering. Agglomerative clustering is a hierarchical clustering method. It involves merging examples until the desired number of clusters is achieved. And since we have 2 to be the optimal number of clusters from the methods we used above, we will set it to 2 in the Agglomerative Clustering.

# Fit Agglomerative Clustering model
clustering = AgglomerativeClustering(n_clusters=2)
clustering.fit_predict(X_pca)

# Obtain cluster labels
cluster_labels = clustering.labels_

# Add the cluster labels to the original dataframe
df_clustered = df_scaled.copy()
df_clustered['Cluster'] = cluster_labels

# Analyze Clusters
cluster_counts = df_clustered['Cluster'].value_counts()
print('Cluster Counts:')
print(cluster_counts)

# Visualize the clusters
plt.figure(figsize=(8, 6))
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.scatterplot(X_pca[:, 0], X_pca[:, 1], hue=cluster_labels, palette=palette)
plt.title('Agglomerative Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Cluster Counts:
1 1197
0 1008
Name: Cluster, dtype: int64
# Now lets have a visual distribution of the Clusters.
plt.figure(figsize=[6, 6])
order = df_clustered['Cluster'].value_counts().index
pl = sns.countplot(x=df_clustered["Cluster"], color='#8e0201', order=order)
pl.set_title("Distribution Of The Clusters")
plt.show()
We can see that the clusters are evenly distributed. (Image by Author)

Cluster Profiling

This involves the profiling of the segments and to see where the customers belong by studying the patterns of the clusters.

# Add cluster labels to original dataset
clusters = cluster_labels
df['Clusters'] = clusters.astype(int)
plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.scatterplot(data=df, x='Income', y='Total_Spending', hue='Clusters', palette=palette)
plt.title('Total Spending and Income based on Clusters')
plt.show()

Insights:
Cluster 0: High Income and High Spending
Cluster 1: Low Income and Low Spending

plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.scatterplot(data=df, x='Total_Spending', y='Age', hue='Clusters', palette=palette)
plt.title('Total Spending and Age based on Clusters')
plt.show();

Insights:
There isn’t much insight to be drawn from the above cluster patterns of the relationship between Customer’s age and the total amount they have spent as can be seen above.

plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.violinplot(data=df, x='Clusters', y='Total_Purchases', palette=palette)
plt.title('Total number of purchases made')
plt.xlabel("Number Of purchases made")
plt.show();

Insights:
Cluster 0: The total number of purchases made on average is 21 and the highest total purchases from folks in this segment is 43.
Cluster 1: The total number of purchases made on average is about 10 and the highest total purchases from folks in this segment is 26. This isn’t really a surprise as Customers in Cluster 0 are the highest earning customers and also spend the most. Again, the lowest total purchases by customers in the first cluster is 7 while some customers in the second cluster didn't even make any purchase at all.

plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.boxplot(data=df, x='Clusters', y='Age', palette=palette)
plt.title('Clusters by Age')
plt.show();

Insights:
Firstly, we can see no outlier points in the Age distribution for both set of customers. Customers in the Cluster 0 are the oldest with their median age at almost 50 and the oldest at 75 while those belonging to Cluster 1 have their median age pegged at a little above 40 and oldest at 70.

plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.countplot(data=df, x='Total_Acceptance', hue='Clusters', palette=palette)
plt.title('Total number of marketing campaigns accepted')
plt.xlabel("Number Of Total Accepted Promotions")
plt.show();

Insights:
As can be seen above, there has been a reluctance by customers in the campaigns they have accepted. No customer also partakes in all five of the campaigns.

plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.countplot(data=df, x='Months_of_Enrollment', hue='Clusters', palette=palette)
plt.xlabel('Number of Months since Enrollment')
plt.ylabel('Frequency')
plt.show();

Insights:
The customers that enrolled the earliest and spent the most belong to Cluster 0. The second cluster has customers that spent the lowest with the earliest enrollments among them contributing to their very low spending.

plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.countplot(data=df, x='No_of_Dependents', hue='Clusters', palette=palette)
plt.xlabel('Number of Dependents in customers household')
plt.ylabel('Frequency')
plt.show();

Insights:
Customers belonging to the first cluster mostly have no dependents in their household while the customers in the second cluster majorly have a single dependent. Overall customers with dependents mostly belong to the second cluster.

# Marital Status
plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.countplot(data=df, x='Marital_Status_New', hue='Clusters', palette=palette)
plt.xlabel('Marital Status')
plt.ylabel('Frequency')
plt.xticks(ticks=[0, 1], labels=['Single', 'Married'])
plt.show();

Insights:
The marital status in all clusters indicates married with a big majority going to the customers in Cluster 1.

# no_of_dependents vs income
plt.figure(figsize=[10, 7])
colors = ['#8e0201', '#c93432', '#e8cccc']
palette = sns.set_palette(sns.color_palette(colors))
sns.barplot(data=df, x='No_of_Dependents', y='Income', hue='Clusters', palette=palette, ci=None)
plt.title('Number of dependents Clusters')
plt.xlabel('No of Dependents')
plt.ylabel('Income')
# plt.xticks(ticks=[0, 1], labels=['Single', 'Married'])
plt.show();

Insights:
Majority of customers in Cluster 0 have no dependent/child and they have the highest income while most customers in Cluster 1 have 3 dependents/children and are the highest earners in this case.

Conclusion

  • Most customers didn't make a complaint in the last two years.
  • Most customers have a single dependent/child in their household.
  • Most customers made purchases via the Store.
  • Most customers enrolled between 20 and 25 months ago
  • Majority of the customers have partners.
  • Majority of the customers have been Graduates.
  • Majority of the customers are parents.
  • Customers that didn’t respond to campaigns are the majority.
  • The earliest enrollments in the business were by customers aged 48.
  • The total amount spent is highly correlated with customers income and the total purchases they made.
  • The highest spending age groups are between 40–60.
  • Of the three metrics we used to evaluate the model, two of them(majority) suggested 2 clusters.
  • Customers with no dependents/children have been the highest spenders.

Business Insights

  • To address the reluctance of customers in accepting the campaigns and encourage higher participation rates, tailor the campaign offers specifically to the preferences and needs of customers in cluster 1.
  • To address the challenges of low income and low spending among customers in Cluster 1, develop affordable and cost-effective products that cater to the budget constraints and you can also implement programs that provide incentives if they repeatedly buy a product more than once which will allow the, to continue engaging with the business even with their low income.
  • The preferences, priorities, and purchasing behaviors of older customers in Cluster 0 may differ from those of the younger customers in Cluster 1 hence the need to consider age-related factors when developing new products or enhancing existing ones.
  • Given that the majority of customers make purchases from the stores, it is crucial to maintain a strong presence and optimize the in-store experience and while fewer customers make purchases online, it highlights the potential for growth in the e-commerce channel hence the business should focus on improving website usability.

Thanks for reading🤓. Check out my GitHub repo where you can find the code and other resources related to the project. Your feedback and suggestions are valuable and will help me improve and deliver more meaningful content in the future. Cheers!

--

--

Abdulraqib Omotosho

Passionate Data Enthusiast & Computer Engineering student. Skilled in data analysis, modeling, and programming. Sharing insights on Medium.