Clustering(K-Mean and Hierarchical) with Practical Implementation

Amir Ali
The Art of Data Scicne
11 min readFeb 10, 2019

In this chapter, we will discuss Clustering Algorithms (k-Mean and Hierarchical) which are unsupervised Machine Learning Algorithms.

This chapter spans 5 parts:

  1. What is Clustering?
  2. How k-Mean Cluster Work?
  3. How Hierarchical Cluster Work?
  4. Practical Implementation of k-Mean Cluster.
  5. Practical Implementation of Hierarchical Cluster.

1. What is Clustering?

Clustering analysis or Clustering is the task of grouping a set of an object in such a way object in the same group(called cluster) are more similar( in some sense or another to each other than to those in another group (clusters). It is the main task of exploratory data mining & a common technique for statistical data analysis used in many fields including Machine Learning, Pattern Recognition, Image Analysis, Information Retrieval, Bioinformatics, Data Compression, and Computer Graphics. Example: Library

we will be dealing with two main algorithms

  1. K-means clustering
  2. 2.Hierarchical Clustering

2. How does Clustering Algorithm work?

2.1: k-Mean Cluster.

2.1.1: What is k-Mean?

K-Means clustering aims to partition n observation into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

2.1.2: How k-Mean Cluster work?

The k-Means clustering algorithm attempt to split a given anonymous data set(a set of containing information as to class identity into a fixed number (k) of the cluster.

Initially, k number of the so-called centroid is chosen. A centroid is a data point (imaginary or real) at the center of the cluster.

Dataset Description:

This dataset has three attributes first is an item which is our target to make a cluster of similar items second and the third attribute is the informatics value of that item.

Now In the first step take any random row to let’s suppose I take row 1 and row 3.

Now take these above centroid values to compare with observing the value of the respected row of our data by using the Euclidean Distance formula.

Now let’s solve one by one

Row 1 (A)

Row 2 (B)

Row 3 (C )

Row 4 (D)

Row 5 (E)

Now

Let’s say A and B are belong the Cluster 1 and C, D and E.

As we show in below table

Now calculate the centroid of cluster 1 and 2 respectively and again calculate the closest mean until calculate when our centroid is repeated previous one.

Now find the Centroid of respected Cluster 1 and Cluster 2

New Centroid

X1 = (1, 0.5)

X2 = (1.7,3.7)

Previous Centroid

X1 = (1, 1)

X2 = (0, 2)

If New Centroid Value is equal to previous Centroid Value then our cluster is final otherwise if not equal then repeat the step until new Centroid value is equal to previous Centroid value.

So in our case new centroid value is not equal to previous centroid.

Now recalculate cluster having the closest mean.

So

X1 = (1, 0.5)

X2 = (1.7,3.7)

Similarly procedure as we calculate above

So based on based one, A B and C belongs to cluster 1 & D and E from cluster 2.

So mean of Cluster I and 2

X1( Cluster 1 ) = ( 0.7 , 1)

X1( Cluster 2 ) = ( 2.5 , 4.5)

New Centroid

X1 = ( 0.7 , 1)

X2 = ( 2.5 , 4.5)

Previous Centroid

X1 = (1, 0.5)

X2 = (1.7,3.7)

If New Centoid Value is equal to previous Centroid Value then our cluster is final otherwise if not equal then repeat the step until new Centroid value is equal to previous Centroid value .

So in our case new centroid value is not equal to previous centroid.

Now recalculate cluster having a closest mean similar step

So based on closest distance, A B and C belongs to cluster 1 & D and E from cluster 2.

So mean of Cluster I and 2

X1( Cluster 1 ) = ( 0.7 , 1)

X1( Cluster 2 ) = ( 2.5 , 4.5)

New Centroid

X1 = (1, 0.5)

X2 = (1.7,3.7)

Previous Centroid

X1 = (1, 0.5)

X2 = (1.7,3.7)

So here we have New Centroid values is Equal to previous value and Hence our cluster are final. A, B and C are belong to cluster 1 and D and E are belong to Cluster 2.

As shown in fig:

Note: If you want this article check out my academia.edu profile.

2.2: Hierarchical Cluster.

2.2.1: What is Hierarchical?

Similar working like K-Mean clustering but the difference is that we create a tree structure.So

Bottom-up:

Initially, each point is a cluster.

Repeatedly combine the two “nearest” clusters into one.

Top-Down:

Start with one cluster and recursively split it.

Used for clustering similar things join and make a hierarchical clustering.

2.2.2: How Hierarchical Cluster work?

Agglomerative with Dendogram

Consider only one part (lower triangle part)

Now find the minimum distance

Minimum distance = 1 for E & A

Hence merge EA

So

Now Find the minimum distance from EA to other

C = min [ dist { (E, A), C} ]

= min [ dist(E, C) , dist(A, C) ]

= min [2, 2]

= 2

B = min [ dist { (E, A), B} ]

= min [ dist(E, B) , dist(A, C) ]

= min [2, 5]

= 2

D = min [ dist { (E, A), D} ]

= min [ dist(E, D) , dist(A, D) ]

= min [3, 3]

= 3

In this minimum distance = 1

So for CB

Now again find the minimum distance from CB to another point as we find above So

EA = min [ dist { (E, A), (C, B} ]

= min [ dist { (E, C) , (E, B) , (A, C), (A, B) } ]

= min dist[2, 2, 2, 5]

= 2

D = min [ dist { (C, B), D} ]

= min [ dist{(C, D) , (B, D)} ]

= min dist[6, 3]

= 3

Now minum distance = 2 for EA , and CB from above table

D = min(dist [ ( E, A , C, B ) , D ]

= min dist [ (E, D), (A, D). (C, D) , (B, D)

= min dist [3, 3, 6, 3]

= 3

So

So finally are dendogram are final

Note: If you want this article check out my academia.edu profile.

3. Practical Implementation of Clustering Algorithms.

3.1: Practical Implementation of k-Mean Cluster

Dataset Description:

This Dataset has complete the information about Mall Customer Spending Score. The Dataset contains five attributes and 200 instances. The first attribute is Customer ID which has every Customer has Unique Second is Gender which is of course male/female third attribute is age which is between 19 to 70 of different customers 4th attribute is Annual Income in k$ which have different customer have a different Income some have very low some have middle and some have very high income and last attribute Spending Score which he spends on Mall.

Part 1: Data Preprocessing:

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. A library is a tool that you can use to make a specific job. First of all, we import the numpy library used for multidimensional array then import the pandas library used to import the dataset and in last we import matplotlib library used for plotting the graph.

1.2 Import the dataset

After importing the Libraries using the pandas library import the dataset Note that we are working with Unsupervised Learning so we use the only X.

Part 2: elbow method to find the optimal number of clusters

In this second part, we import our model. And apply the Elbow method to choose the optimal clusters.

So as we see in the diagram the line bend on 5 so 5 clusters are decided to take.

Part 3: Applying kMeans to the mall dataset and Visualizing the Clusters.

3.1 Init the Model

In the final step, we use the Our Model which is the K-Means cluster and take several Cluster 5 and makes a cluster of five different customers based on spending score in the mall.

3.2 Fit and Predict the Dataset

In the step, we fit the dataset into our model and predict them.

3.3 Visualizing the Result

In the visualizing step, we make a graph take an Income on X and Spending Score on y. So we can easily visualize the different customers in our graph. As we can see there are different customer have a different behavior which is Careful, Standard, Target, Careless, and Sensible.

If you want dataset and code you also check my Github Profile.

3.2: Practical Implementation of Hierarchical Cluster.

Dataset Description:

This Dataset has complete the information about Mall Customer Spending Score. The Dataset contains five attributes and 200 instances. The first attribute is Customer ID which has every Customer has Unique Second is Gender which is of course male/female third attribute is age which is between 19 to 70 of different customers 4th attribute is Annual Income in k$ which have different customer have a different Income some have very low some have middle and some have very high income and last attribute Spending Score which he spends on Mall.

Part 1: Data Preprocessing:

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. A library is a tool that you can use to make a specific job. First of all, we import the numpy library used for multidimensional array then import the pandas library used to import the dataset and in last we import matplotlib library used for plotting the graph.

1.2 Import the dataset

After importing the Libraries using the pandas library import the dataset Note that we are working with Unsupervised Learning so we use the only X.

Part 2: Use the dendogram method to find the optimal number of clusters

In this second part, we apply the Dendogram method to choose the optimal clusters.

So based on the dendogram we choose the number of cluster =5.

Part 3: Applying Hierarchical to the mall dataset and Visualizing the Clusters

3.1 Import and Init the model

In the final step, we import our model from Scikit Learn Library and then initialize the model by taking five numbers of the cluster which we decide from the Dendogram method.

3.2 Fit and Predict the Dataset

In the step, we fit the dataset into our model and predict them.

3.3 Visualizing the Result

In the visualizing step, we make a graph take an Income on X and Spending Score on y. So we can easily visualize the different customers in our graph. As we can see there are different customer have a different behavior which is Careful, Standard, Target, Careless, and Sensible.

If you want dataset and code you also check my Github Profile.

End Notes:

End Notes:

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Github for code & dataset follow on Aacademia.edu for this article, Twitter and Email me directly or find me on LinkedIn. I’d love to hear from you.

That’s all folks, Have a nice day :)

--

--