Clustering Analysis of Mall Customer
using python, Numpy, Panda, Matplotlib, Seaborn, scikit-learn
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
It is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples.
Example: Let’s understand the clustering technique with the real-world example of Mall: When we visit any shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one section, and the trousers are in other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out things. The clustering technique also works in the same way. Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
- Market Segmentation
- Statistical data analysis
- Social network analysis
- Image segmentation
- Anomaly detection, etc.
Dataset:
This Dataset is based on malls' customers. There are a total of 200 rows and 5 columns in this dataset. By using this dataset this data analysis and machine learning project is created.
Here we use google COLAB to run these codes and analysis the dataset but you can use other platforms also to run the code.
Now let’s begin what this project contains.
Importing the Libraries:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
Loading the Files:
import iodf2 = pd.read_csv('Mall_Customers.csv')
Displaying the Data:
df2.head()
The shape of the DataSet:
df2.shape
Information about all the columns in the Dataset:
df2.info()
Description of DataSet:
df2.describe()
Checking the Null Values in the Data Set:
df2.isnull().values.any()
Male vs Female Ratio:
labels = ['Female', 'Male']size = df2['Gender'].value_counts()colors = ['lightgreen', 'orange']explode = [0, 0.1]plt.rcParams['figure.figsize'] = (9, 9)plt.pie(size, colors = colors, explode = explode, labels = labels, shadow = True, autopct = '%.2f%%')plt.title('Gender', fontsize = 20)plt.axis('off')plt.legend()plt.show()
By looking at the above pie chart which explains about the distribution of Gender in the Mall
Interestingly, The Females are in the lead with a share of 56% whereas the Males have a share of 44%, that’s a huge gap especially when the population of Males is comparatively higher than Females.
Age vs Annual Income:
plt.figure(figsize=(25,10))# Passing X axis and Y axis along with subplot positionplt.title('Age vs Annual Income', fontsize = 20)plt.xticks(rotation=90)sns.barplot(x = df2['Age'] , y = df2['Annual Income (k$)'] , palette='icefire');
The annual income is maximum at the age of 33 and 42.
Age Distribution:
plt.rcParams['figure.figsize'] = (25, 8)sns.countplot(df2['Age'], palette = 'hsv')plt.title('Distribution of Age', fontsize = 20)plt.show()
This Graph shows a more Interactive Chart about the distribution of each Age Group in the Mall for more clarity about the Visitor’s Age Group in the Mall.
By looking at the above graph-, It can be seen that the Ages from 27 to 39 are very much frequent but there is no clear pattern, we can only find some group-wise patterns such as the older age groups are lesser frequent in comparison. Interesting Fact, There are equal no. of Visitors in the Mall for the Agee 18 and 67. People of Age 55, 56, 69, 64 are very less frequent in the Malls. People at Age 32 are the Most Frequent Visitors in the Mall.
Pair plot of Data:
sns.pairplot(df2)plt.rcParams['figure.figsize'] = (25, 8)plt.title('Pairplot for the Data', fontsize = 20)plt.show()
Heat Map of Data:
plt.rcParams['figure.figsize'] = (15, 8)sns.heatmap(df2.corr(), cmap = 'Wistia', annot = True)plt.title('Heatmap for the Data', fontsize = 20)plt.show()
The Above Graph for Showing the correlation between the different attributes of the Mall Customer Segmentation Dataset, This Heat map reflects the most correlated features with Orange Color and the least correlated features with yellow color.
We can clearly see that these attributes do not have a good correlation among them, that’s why we will proceed with all of the features.
Age vs Spending Score:
ax = sns.barplot(y= "Spending Score (1-100)", x = "Age", data = df2, palette=("Blues_d"))sns.set(rc={'figure.figsize':(27.7,6.30)})sns.set_context("poster")
We can see that the age between 28 and 39 get the maximum spending score and the reason behind it is that their annual income is very high.
Distribution of Spending Score:
plt.rcParams['figure.figsize'] = (35, 14)sns.countplot(df2['Spending Score (1-100)'], palette = 'magma')plt.title('Distribution of Spending Score', fontsize = 20)plt.show()
This is the Most Important Chart in the perspective of the Mall, as It is very Important to have some intuition and idea about the Spending Score of the Customers Visiting the Mall.
On a general level, we may conclude that most of the Customers have their Spending Score in the range of 35–60. Interestingly there are customers having an I spending score also, and a 99 Spending score also, Which shows that the mall caters to a variety of Customers with Varying needs and requirements available in the Mall.
Gender vs Spending Score:
plt.rcParams['figure.figsize'] = (18, 7)sns.boxenplot(df2['Gender'], df2['Spending Score (1-100)'], palette = 'Accent_r')plt.title('Gender vs Spending Score', fontsize = 20)plt.show()
It is clearly visible that most of the males have a Spending Score of around 25k US Dollars to 70k US Dollars whereas the Females have a spending score of around 35k US Dollars to 75k US Dollars. which again points to the fact that women are Shopping Leaders.
Gender vs Annual Income:
plt.rcParams['figure.figsize'] = (18, 7)sns.violinplot(df2['Gender'], df2['Annual Income (k$)'], palette = 'gnuplot')plt.title('Gender vs Spending Score', fontsize = 20)plt.show()
From the above graph, we can see that there are more males who get paid more than females. But, The number of males and females is equal in number when it comes to low annual income.
Distribution of Spending Score:
plt.rcParams['figure.figsize'] = (35, 14)sns.countplot(df2['Spending Score (1-100)'], palette = 'gist_rainbow')plt.title('Distribution of Spending Score', fontsize = 20)plt.show()
From the above graph, we can see that the total spending score by a single person is 99 and the maximum person get 42 as the total spending score.
Annual Income vs Spending Score:
ax = sns.barplot(y= "Spending Score (1-100)", x = "Annual Income (k$)", data = df2, palette=("Blues_d"))sns.set(rc={'figure.figsize':(11.7,8.27)})sns.set_context("poster")
Annual Income vs Age and Spending Score:
x = df2['Annual Income (k$)']y = df2['Age']z = df2['Spending Score (1-100)']sns.lineplot(x, y, color = 'blue', palette = 'Accent_r')sns.lineplot(x, z, color = 'pink', palette = 'Accent_r')plt.title('Annual Income vs Age and Spending Score', fontsize = 20)plt.show()
The above Plot Between Annual Income and Age represented by a blue color line and a plot between Annual Income and the Spending Score is represented by a pink color. It shows how Age and Spending Vary with Annual Income.
K Means Clustering:
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The unsupervised k-means algorithm has a loose relationship to the k-nearest neighbor classifier, a popular supervised machine learning technique for classification that is often confused with k-means due to the name. Applying the 1-nearest neighbor classifier to the cluster centers obtained by k-means classifies new data into the existing clusters. This is known as the nearest centroid classifier or Rocchio algorithm.
x = df2.iloc[:, [3, 4]].values
Here we have taken the 4th and 5th columns for clustering analysis.
print(x.shape)
Elbow Method to Find the Number of Optimal Clusters:
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set.
The method can be traced to speculation by Robert L. Thorndike in 1953.
from sklearn.cluster import KMeanswcss = []for i in range(1, 11):km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)km.fit(x)wcss.append(km.inertia_)plt.plot(range(1, 11), wcss)plt.title('The Elbow Method', fontsize = 20)plt.xlabel('No. of Clusters')plt.ylabel('wcss')plt.show()
In the above graph, the point after which the slope is getting decreased is called the elbow point.
K Means Model Training on Training set:
km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)y_means = km.fit_predict(x)
From the above training, we get a total of 5 clusters over the dataset.
Visualizing Clusters:
plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'miser')plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'general')plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'cyan', label = 'target')plt.scatter(x[y_means == 3, 0], x[y_means == 3, 1], s = 100, c = 'magenta', label = 'spendthrift')plt.scatter(x[y_means == 4, 0], x[y_means == 4, 1], s = 100, c = 'orange', label = 'careful')plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'black' , label = 'centeroid')plt.style.use('fivethirtyeight')plt.title('K Means Clustering', fontsize = 20)plt.xlabel('Annual Income')plt.ylabel('Spending Score')plt.legend()plt.grid()plt.show()
From the above scatter plot we can visualize different clusters among annual income and total spending scores. There is a total of 5 clusters are created in this scatter plot and the black points are the centroid of the 5 clusters.
Hierarchical Clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from the other cluster, and the objects within each cluster are broadly similar to each other. In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
Agglomerative:
This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive:
This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of O(n³) and requires O(n²) memory, which makes it too slow for even medium data sets.
import scipy.cluster.hierarchy as schdendrogram = sch.dendrogram(sch.linkage(x, method = 'ward'))plt.title('Dendrogam', fontsize = 20)plt.xlabel('Customers')plt.ylabel('Ecuclidean Distance')plt.show()
A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters. The dendrogram above shows the hierarchical clustering of different observations shown on the scatterplot.
Hierarchical Clustering Model Training on Training set:
from sklearn.cluster import AgglomerativeClusteringhc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')y_hc = hc.fit_predict(x)
Visualizing Clusters:
plt.scatter(x[y_hc == 0, 0], x[y_hc == 0, 1], s = 100, c = 'pink', label = 'miser')plt.scatter(x[y_hc == 1, 0], x[y_hc == 1, 1], s = 100, c = 'yellow', label = 'general')plt.scatter(x[y_hc == 2, 0], x[y_hc == 2, 1], s = 100, c = 'cyan', label = 'target')plt.scatter(x[y_hc == 3, 0], x[y_hc == 3, 1], s = 100, c = 'magenta', label = 'spendthrift')plt.scatter(x[y_hc == 4, 0], x[y_hc == 4, 1], s = 100, c = 'orange', label = 'careful')plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')plt.style.use('fivethirtyeight')plt.title('Hierarchial Clustering', fontsize = 20)plt.xlabel('Annual Income')plt.ylabel('Spending Score')plt.legend()plt.grid()plt.show()
From the above scatter plot we can visualize different clusters among annual income and total spending scores. There is a total of 5 clusters are created in this scatter plot and the blue points are the centroid of the 5 clusters.
Project Synopsis:
- This is a project of clustering over shopping mall customer data set.
- First I have implemented K means clustering then I have implemented hierarchical clustering.
- I finally get 5 clusters from the scatter plot diagram.
- In hierarchical clustering, I have plotted a dendrogram graph.
5. In each clustering we have got a centroid which is equidistance.