Unlocking Customer Insights: A Data-Driven Approach to Mall Customer Segmentation

Albinlamichhane
10 min readMay 20, 2024

In the ever-evolving world of retail, understanding your customers is key to success. But how can mall owners and managers gain deeper insights into their diverse clientele? The answer lies in the power of data science and customer segmentation.

In this post, we’ll dive into a real-world case study of how we used Python and K-means clustering to uncover hidden patterns within mall customer data. We’ll walk you through the entire process, from data cleaning and exploration to building a model that reveals distinct customer segments. By the end, you’ll understand how to apply these techniques to your own retail data, enabling you to tailor marketing campaigns, enhance customer experiences, and ultimately drive sales.

Data Exploration: Unveiling the Customer Landscape

Our journey begins by loading the customer data from a CSV file.

file_path = "./data/Mall_Customers.csv"
customer_data = pd.read_csv(file_path)

After some initial cleaning, we get an overview of the data using descriptive statistics and visualizations.

The figure describe the contents of the datasets. Here’s a summary of its contents:

  • CustomerID: Unique identifiers for customers, ranging from 190 to 200.
  • Gender: Categorization of customers as ‘Female’ or ‘Male’.
  • Age: Age of customers, between 30 to 38 years.
  • Annual Income (k$): Income figures ranging from 103 to 137 thousand dollars.
  • Spending Score (1–100): A score representing spending habits, with values from 16 to 91.

Then we use generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution,exclusig NaN values.

 clean_data.describe()

Here’s a breakdown of the result:

  • count: The number of non-null entries. There are 200 entries for each variable.
  • mean: The average value. For Age, it’s approximately 38.85 years; for Annual Income, it’s 60.56k$; and for Spending Score, it’s 50.2.
  • std: Standard deviation, which measures the amount of variation or dispersion. The values are 13.97 for Age, 26.26 for Annual Income, and 25.82 for Spending Score.
  • min: The minimum value in the dataset. The youngest Age is 18, the lowest Annual Income is 15k$, and the lowest Spending Score is 1.
  • 25%: The 25th percentile, also known as the first quartile. It means that 25% of the data is below this value. For Age, it’s 28.75; for Annual Income, it’s 41.5k$; and for Spending Score, it’s 34.75.
  • 50%: The median or the 50th percentile. Half the data falls below this value. For Age, it’s 36; for Annual Income, it’s 61.5k$; and for Spending Score, it’s 50.
  • 75%: The 75th percentile, or third quartile. 75% of the data falls below this value. For Age, it’s 49; for Annual Income, it’s 78k$; and for Spending Score, it’s 73.
  • max: The maximum value in the dataset. The oldest Age is 70, the highest Annual Income is 137k$, and the highest Spending Score is 99.

The subplot was also created to visualize outliers for the Annual Income.

# Create subplot to visualize outliers
fig = px.box(clean_data, y="Annual Income")
fig.show()

The box plot reveals a right skew in the customer mall dataset’s annual income. This means a larger portion of customers fall in to the lower and middle-income brackets, with the box positioned towards the left side. The tail extending to the right shows a smaller, but existent, group of high earners. This skewness might reflect a general income distribution pattern or be influenced by the type of stores the mall attracts. Additionally, the data collection method could play a role. Surveying mall visitors might miss high-income individuals who rarely visit. Understanding this right skew is valuable for businesses. They can tailor marketing campaigns towards the larger customer segment and optimize product offerings within the mall. While value might be a key focus, the presence of high earners suggests a potential niche market for premium options within the mall as well.

next, we leverage Seaborn’s displot function to create a nuanced visualization of age distribution across genders. This code snippet,

sns.displot(clean_data, x='Age', hue='Gender', kind='kde')

will generate a kernel density estimate (KDE) plot. KDEs are like smooth probability hills, showing the likelihood of customers failing within certain age ranges.

Here’s what the plot reveals:

  • The X-axis represents age, likely spanning from young adults (around 18) to older demographics (up to 80 or higher).
  • The Y-axis reflects density, the probability of a customer belonging to a specific age group. Higher bumps on the curve indicate a greater concentration of customers in that age range.
  • The magic of color-coding comes into play with the ‘hue’ parameter set to ‘Gender’. This creates two curves, one for males (often shown in blue) and another for females (often shown in orange).

Visualizing age distribution is key to understanding our customer base. This code snippet creates a histogram using Plotly Express, unveiling how customer ages are spread out.

Decoding Spending Habits with Distribution Plots

Next, we delve into customer spending habits using Plotly Express. We create histograms to visualize the distribution of spending scores across genders. These histograms not only reveal the spread of spending scores but also highlight any gender-based differences in spending patterns.

avg_cost = clean_data['Age'].mean().round(0)
fig = px.histogram(
clean_data,
x = 'Age',
color='Gender',
# labels={'comb08': "Annual Fuel Cost"},
nbins=10,
title='Age Distribution',
color_discrete_sequence=px.colors.sequential.Magenta_r,
)
fig.add_vline(
x = avg_cost,
annotation_text = f'Average {avg_cost}'
)
fig.show()

The X-axis represents age, likely ranging from young adults to older demographics. The Y-axis shows how many customers fall within each age group. The magic lies in color-coding by gender. More dark purple indicates the male population and light purple indicates the female population allow us to compare age distribution between genders. To make things even more insightful, an average age line is added, highlighting the central tendency of the data. This age distribution plot, along with insights from other visualizations, paves the way for data-driven customer segmentation. Here we can see that in comparison with male, there are more female customers and for both male and female more number can be seen with in the range of 30–40 age group.In conclusion, this histogram is a valuable tool for visualizing customer age distribution and gender breakdown within a mall customer dataset. By understanding these demographics, businesses can make data-driven decisions to improve customer segmentation and marketing strategies.

 avg_cost = clean_data['Spending Score'].mean().round(0)
fig = px.histogram(
clean_data,
x = 'Spending Score',
color='Gender',
# labels={'comb08': "Annual Fuel Cost"},
nbins=10,
title='Spending Score Distribution',
color_discrete_sequence=px.colors.sequential.Magenta_r,
)
fig.add_vline(
x = avg_cost,
annotation_text = f'Average ${avg_cost}k'
)
fig.show()

Given code snippet dives into customer spending habits using Plotly Express. It calculates the average spending score and stores it in a variable. Then, it creates a histogram to visualize the distribution of spending scores across genders.

The X-axis represents the spending score, likely ranging from zero (or close to it) to a higher value (potentially exceeding $100,000). The Y-axis shows the number of customers within each spending score range. The magic lies in color-coding by gender ,allowing us to see how spending patterns differ between genders. To make things even more insightful, a vertical line is added at the average spending score, labeled “Average $” with the value in thousands (“k”).

This spending score distribution is a goldmine for customer segmentation strategies. By analyzing the histogram, businesses can identify customer groups based on spending habits. For example, if there’s a concentration of customers in a specific spending range (like $40,000-$60,000), they can tailor marketing campaigns or loyalty programs specifically for that segment. Additionally, the gender breakdown within each spending score group can inform gender-specific strategies. This visualization, along with others like the age distribution plot, paves the way for data-driven customer segmentation, empowering businesses to target the right customers with the right offers.

K-Means Clustering: Grouping Customers for Strategic Insights

Then we dives into K-Means clustering to segment customers based on their spending habits and annual income. First, we define the possible numbers of clusters (customer groups) to explore, ranging from 2 to 15.

#Define number of cluster want to fit and evaluate
num_clusters = [i for i in range(2, 16)]
num_clusters

Then, we focus on two key features: “Annual Income” and “Spending Score”. These are standardized to ensure they’re on a similar scale for the clustering algorithm.

#Standarize the data
X_2dim_scaled = StandardScaler().fit_transform(X_2dim)
#Transform to dataframe
X_2dim_scaled = pd.DataFrame(X_2dim_scaled,columns = X_2dim.columns)

An important step is finding the optimal number of clusters. The code calculates a metric called inertia for different cluster counts.

 #Funtion to fit kmeans and evaluate inertia for different number of cluster
def kmeans_inertia(num_clusters, x_vals):
"""
Accepts as arguments list of ints and data array.
Fits a KMeans model where k = each value in the list of ints.
Parameters:
- num_clusters: number of cluster will fit and evaluate inertia
- x_vals: the dataframe need to fit and evaluate inertia
Returns list of inertia value for each k using KMeans model to fit.
"""
inertia = []
for num in num_clusters:
kms = KMeans(n_clusters=num, random_state=20)
kms.fit(x_vals)
inertia.append(kms.inertia_)
return inertia

Inertia essentially measures how well data points fit within their assigned clusters. We visualize these inertia values to identify the “sweet spot” — where the inertia starts to level off, indicating a good balance between the number of clusters and how well they represent the data.

Once we have an optimal number of clusters (likely 5 based on the unseen inertia plot), a K-Means model is applied. This model assigns customers to clusters based on their income and spending score.

#Fit a 5-clusters model
kmeans5_2dim = KMeans(n_clusters=5, n_init='auto', random_state=50)
kmeans5_2dim.fit(X_2dim_scaled)

We then analyze these clusters in two ways. First, a heatmap visualizes the average income and spending score (cluster centers) for each customer group. Second, a scatterplot depicts individual customers colored by their assigned cluster. This allows us to see how customers are distributed across the different spending and income segments.

This heatmap is a powerful tool for unpacking customer segments identified through K-Means clustering. Each cluster (0–4) represents a distinct group of customers. The color intensity reveals key characteristics based on their annual income and spending score. Red indicates higher values, while blue signifies lower values.

As obseration, below are characteristics of 5 clusters:

  • Cluster 0: High income but low spending score
  • Cluster 1: Average income and average spending score
  • Cluster 2: Low income and high spending score
  • Cluster 3: High income and high spending score
  • Cluster 4: Low income and low spending score

By understanding these distinct customer segments, businesses can develop targeted strategies. Imagine offering special discounts to budget-conscious clusters or designing loyalty programs to retain high spenders. Product development can also be informed by these segments, with features and pricing tailored to resonate with each group. This heatmap, a key part of the K-Means clustering process, empowers data-driven customer segmentation, allowing businesses to target the right customers with the right offerings.

This scatter plot dives into customer spending habits, revealing how spending scores are distributed across genders. The X-axis shows spending score, likely ranging from zero (or near it) to a higher value. The Y-axis indicates the number of customers within each spending score range. Each dot represents a customer’s spending score, colored blue or orange to denote their gender. This color-coding is key for spotting spending differences between genders.

This visualization is a goldmine for understanding customer behavior. It’s easier to see patterns compared to raw data. Businesses can identify customer segments based on spending habits (clusters of data points) and tailor marketing or product development accordingly. Additionally, visualizing spending by gender helps businesses understand these differences and personalize their approach to resonate better with each customer group. Ultimately, this scatter plot empowers data-driven customer segmentation and strategy development.

In today’s competitive retail landscape, understanding your customers is no longer a luxury, it’s a necessity. This case study has explored how data science, specifically K-means clustering, can be leveraged to unlock valuable customer insights from mall customer data.

By following a data-driven approach that involves data exploration, visualization, and customer segmentation, mall owners and managers can gain a deeper understanding of their customer base. We’ve seen how descriptive statistics, histograms, KDE plots, and box plots provide a clear picture of customer demographics, spending habits, and potential outliers.

K-means clustering takes us a step further, by grouping customers into distinct segments based on their annual income and spending score. Visualizing these segments through heatmaps and scatter plots empowers businesses to identify customer profiles like high-income, low-spending individuals or budget-conscious, high-spending customers.

This granular understanding of customer behavior allows for the development of highly targeted strategies. From crafting personalized marketing campaigns and loyalty programs to tailoring product offerings and pricing, the possibilities are vast. Ultimately, data-driven customer segmentation empowers businesses to make informed decisions that enhance customer experiences, drive sales, and ensure long-term success in the ever-evolving retail world.

--

--