Mall Customers Clustering Analysis

using SAS Enterprise Miner

Mario Caesar
Data Science Indonesia
8 min readJan 14


Photo by Heidi Fin on Unsplash


Customer clustering is a process of grouping customers into specific groups based on similar behavior/characteristics they have. Companies very popularly use customer segmentation to understand the aspects of a cluster so that companies can provide better experiences to customers, increase customer satisfaction, use effective and efficient marketing strategies for existing customers or prospective customers, as well as target clients with promotions and rewards according to their interests, requirements, and preferences. In addition, marketers may identify which strategies are effective and which need to be improved by objectively evaluating each campaign’s outcomes in terms of economic uplift. The final product will be highly relevant marketing messages that reach every consumer and increase brand impression while maximizing customer value.

Unsupervised machine learning mathematical models/algorithms such as K-Means, hierarchical clustering, DBSCAN, and others can be used to identify similar groups or patterns of consumers by determining which customers within each category have the fewest differences.

This article will discuss how to perform clustering using SAS Enterprise Miner. But before that, we must first understand what SAS Enterprise Miner is.

What is SAS Enterprise Miner?

SAS Enterprise Miner is an advanced analytics data mining tool designed to assist users in developing descriptive and predictive models with a simplified data mining procedure. It allows you to improve the efficiency of data mining by creating efficient models, recognizing significant associations, and identifying the most important trends.

The steps of the SAS SEMMA approach — sampling, exploration, modification, modeling, and assessment — can be quickly followed by users due to Enterprise Miner’s graphical user interface. Users may create process flows by choosing the appropriate tab from Enterprise Miner’s toolbar and then dragging and dropping step-specific components onto a pallet.

Enterprise Miner supports various methods and methodologies, including decision trees, time series, neural networks, linear and logistic regression, market basket analysis, and clustering.

After understanding what SAS Enterprise Miner is, the next step is to cluster in mall customer datasets using SAS Enterprise Miner. The dataset used comes from Kaggle. This dataset contains dummy mall customer data such as customers’ basic information and spending score. It is required to group mall shoppers into several groups based on their behavior to understand their shopping habits to get an effective and efficient marketing strategy.

>> More information about author:

Overall Flow

The following is the overall flow in SAS Enterprise Miner which will be used to implement clustering models and perform analysis of clustering results.

SAS EM Overall Flow — Mall Customer Clustering
Overall Flow for Mall Customer Clustering in SAS EM

The picture above shows the flow of five nodes for clustering analysis in SAS Enterprise Miner. The first step begins by importing the dataset using the “File Import” node. Next, univariate analysis (including frequency distribution, statistical summary, and total missing values) will be performed on each variable present in the dataset using the “MultiPlot” node.

After the dataset has been analyzed, the “Drop” node will be used to remove variables that are not needed in the clustering process. Datasets ready for clustering will enter the “Cluster” node to be clustered. Finally, using the “Segment Profile” node, profiling will be carried out on the clusters that have been formed. The following section will discuss the findings on each node.

Univariate Analysis Result

Univariate Analysis Histograms and Statistical Results

The imported dataset consists of 5 columns with 200 observations and no missing values. In the histogram on the left, it can be seen that mall customers who register for membership are more women than men, and the majority are between their 20s and 30s, with an average age of 38.85 years. In addition, from 200 registered customers, the average annual income earned is 60.56k dollars, with a maximum yearly income of 137k dollars and a minimum annual income of 15k dollars. The majority of revenue that customers get is between 60–75k dollars. Furthermore, the average spending score of 200 customers is 50.20, with the majority of customer spending scores ranging from 40–70.

“Drop” and “Cluster” Nodes Settings

“Drop” (Left) and “Cluster” (Right) Nodes Settings

On the “Drop” node, the variable “CustomerID” will be dropped because this variable contains the unique ID of mall customers.

Furthermore, using the “Cluster” node, clustering will use the default settings, and the variables that will be used are “Annual Income” and “Spending Score.”

Clustering Results

Clustering Results
Clustering Results Distribution

The clustering results show that cluster 4 has more customers than other clusters (about 50% of total customers are in cluster 4). In addition, even though cluster 5 only has 11 customers, this cluster has the highest annual income and spending score compared to other clusters (average annual income of 108.1k dollars and average spending score of 82.7). However, cluster 2 has the lowest average annual income of 26.3k dollars, and cluster 1 has the lowest average spending score of 18.6.

The table above also shows other information, such as:

  • Distance to Nearest Cluster Distance between the nearest cluster centroids,
  • Maximum Distance from Cluster Seed Maximum distance from the cluster seed to cluster observation, and
  • Root-Mean-Square Standard Deviation — Root-means-squared distance between observations in the cluster.

The table shows that the data points in each cluster are well separated because the minimum distance value is more than 1. In addition, cluster 3 has data points close to each other because of the maximum value distance and the lowest root-mean-squared standard deviation. However, this is inversely proportional to cluster 4, where data points are quite far from other data points since the maximum distance, and root-mean-squared standard deviation values are the highest.

Furthermore, the stacked bar chart also displays the binned value distribution of the two columns used for clustering. For example, many customers have annual incomes between 15k and 30.25k dollars in cluster 2. Another example is that customers dominate most customers in cluster 5 with annual incomes between 91.25k and 106.5k dollars, and so on. Furthermore, most customers in cluster 3 have high spending habits because many customers have spending scores between 86.75 and 99, etc.

“Segment Profile” Results

Cluster Histogram Distributions vs. Overall Distribution
Profiling Results (Left) & Profiling Histograms (Right)

This section will discuss the profile or characteristics of each cluster using the “Segment Profile” node. The first image shows the overall distribution (histogram with a red line) compared to the cluster distribution (histogram with filled colors). The second and third image describes the ranking of the cluster's important variable and its worth value. Based on these three images, it can be concluded as follows:

  • Cluster 1: This cluster has customers with the lowest spending score distribution, which is similar to cluster 2. However, most customers in this cluster have medium to high annual income. Customers in this cluster have an age distribution quite similar to the overall distribution, which covers all age groups. When viewed in detail, customers in this cluster are mostly adults to middle-aged adults.
  • Cluster 2: As previously mentioned, this cluster’s distribution of spending scores is the lowest. In addition, this cluster has the lowest annual income compared to other clusters. Similar to cluster 1, customers in this cluster have age distribution that is quite close to overall distribution and mostly filled with adults to middle-aged adults. However, more than half of the elderly shoppers are in this cluster.
  • Cluster 3: Customers in this cluster have a fairly high spending score, similar to cluster 5. However, this cluster’s average annual customer income is in the middle. In addition, most of the customers in this cluster are young adults to adults.
  • Cluster 4: Customers in this cluster have a middle to lower annual income distribution. However, this is inversely proportional to the distribution of spending scores, where the distribution of spending scores tends to be middle to high. The age distribution in this cluster is slightly skewed, where most customers present in this cluster are around 18 years old.
  • Cluster 5: Customers in this cluster have an average annual income and spending score above the average. Furthermore, this cluster is mostly dominated by adult shoppers.


This article successfully clustered the mall customer dataset using SAS Enterprise Miner. Furthermore, this article has also explained the flow and step-by-step how to perform clustering using SAS Enterprise Miner and the results obtained from each node. Lastly, this article has also succeeded in profiling or identifying the characteristics of each cluster that has been created.


Author Message

>> Check out my clustering notebook using “clusteval” library on Kaggle here.



Mario Caesar
Data Science Indonesia

Hello World! 👋 | Just Nobody on Medium | Data Slices creator |