Clustering for beginners

Rajat Pal
AlmaBetter
Published in
5 min readApr 24, 2021

what do we understand by clustering ???

Clustering can be defined as the collection of objects which are similar to each other, you can understand clusters as a group of same kinds of objects.

It helps us in solving bigger problems easily.

we all have seen ManVs Wild where the host of the programs is exploring in the forests, so whenever he finds new things whether its a plant or animal or fruits he tries to recall which all things are most similar to the new thing.

once he finds out which group or cluster this new thing belongs. It becomes easy for him to take a decision whether to keep the thing throw it away.

K-Mean Clustering Algorithm

It a unsupervised machine learning algorithm which help us in finding the clusters in the dataset which we provide.

K- mean is a very popular clustering algorithm. K in k-means is a free parameter the value of which gives us the number of clusters that we want in our data set

How do k-mean works ???

Step 1

Its start by randomly selecting k point in the dataset which we call as centers of those k clusters we call then centroid as well.

Step 2

then we find out the Euclidean distance of all the data points from the centroid of these k clusters and we will assign these data points to the clusters which are having min distance from the centroid of the clusters.

Step 3

Now we update the centroid of the cluster by the mean of all the values present in the clusters and then we will repeat the step 2 again.

we keep on repeating step 2 and 3 until we didn't find any change in centroid values of these clusters or the number til we want to iterate it.

One o the biggest problem with K-Mean clustering is we need to provide k value (number of clusters ), and its hard to decide k value as we have no information about the data.

How to find the optimal value of k For K-Mean clustering?

we can find the optimal value of k with the help of Elbow method.

so in this method, we find out the sum of square error of each cluster separately and then we add all SSE .

SSE = sum of the squared distance of each element from its centroid value

we will plot the graph with

In this elbow curve, we will see a turn at k = 3 we will choose this optimal value of k

we can see that the value of SSE is decreasing here as we increase the k value, as we can see there is an elbow like structure formed here that one of the reasons why we call this curve as an elbow curve.

as we know we have lower point in our elbow from where we rotate our elbow up and down that point can be seen at k=3, so we choose the k value as 3 as the optimal value.

so why don't we choose the k value with a lower SSE value?

so one thing we need to understand here is that as we increase the number of clusters in the dataset we will be having more data set which are closer to the centroid as far data point will form any other cluster hence decreasing the SSE value.

But by doing this we can end up creating more classification in the data which is of no use at all.

The biggest drawback of K-mean Clustering is That we have to provide a k value when performing K-mean clustering. to overcome this we can use Hierarchical Clustering

There are two type of Hierarical Clustering

  • Agglomerative hierarchical clustering
  • Divisive hierarchical clustering

Agglomerative hierarchical clustering

This is also known as bottom-up approach in which we start with each data point as individual clusters and then start combining them on the basis of their distance(there we used Eudicular distance to measure the distance between data points), and in the end we are left with only one cluster.

Each step depict how cluster are formed

Divisive Hierarchical clustering

In this, we use the Top-down approach also known as Divisive analysis

Here we have one cluster which consists of all data points and then we start to divide our cluster into two clusters and we repeat this until we are left with only one element in each cluster.

How should we Choose the Number of Clusters in Hierarchical Clustering?

A dendrogram is a tree-like diagram that records the sequences of merges or splits. More the distance of the vertical lines in the dendrogram, the more the distance between those clusters.

The value of k can be found by selecting the distance in the y-axis and then drawing a vertical line at that distance and then we will count the number of cluster line it cuts that will give us the value of k for our dataset.

so in the above example, we can see there will be 5 clusters in the dataset when then we choose the distance as 135.

This is one of the ways in which we can find the most optimal k value in our dataset. for the optimal solution in clustering, we can use this k value in the k-mean clustering algorithm.

That's all in this blog, if you like my work pls follow me.

--

--