k- Means Clustering

Siddhraj Maramwar
Analytics Vidhya
Published in
3 min readJun 22, 2020

Don’t get confused with KNN.

k-means is a clustering machine learning algorithm.

k-Means is an unsupervised algorithm.

The k-means partitions (divide) data into groups based on the similarities.

K- Means Clustering

It divides the data into K non-overlapping subsets or clusters without any cluster internal structure or labels this means it’s an unsupervised algorithm. Object within the cluster are similar to each other in the same cluster. And different than the objects present in the other clusters.

While using k-Means we come across a question ~

How to find the similarities among the objects (data points)?

The Objective of k-means is that the similar objects go into cluster and dissimilar fall into different clusters.

For this purpose we can us dissimilarity metrics instead similarity. In other words conventionally the distance of samples from each other is used to arrange the clusters so we can say k-means tries to minimize the intra cluster distances and maximize the inter cluster distances.

Inter and Intra cluster distances

Inter-cluster distance- It is a similarity difference between the objects present in the same cluster.

Intra-cluster distance- Similarity difference between the two different cluster.

Now how to calculate dissimilarities between two objects.

The Euclidean distance helps us to find the dissimilarities. There are other methods also like cosine similarity, average distance and so on.

But the euclidean distance measure is widely used for this purpose.

Euclidean distance

Consider an example below :-

There are two people and we want to find the dissimilarities between them using there demographic information.

Let person 1 age is 54, and person 2 age is 50 we can find difference between them by -

difference using age demography

k-means uses scatter plot for the distribution of the objects.

k-means working algorithm

  1. Initialize the number of clusters. (Essentially dermining the number of clusters is hard problem in k- Means).
  2. Distance calculation- in this step we calculate the distance between each data point and the cluster centroid. And store this distance in a matrix.
  3. Assign each point to the closest cluster.
  4. Evaluate the model using SSE- sum of squared error.
  5. Compute the new centroid of the clusters using mean of the data points present in that cluster.
  6. now repeat the steps form 2 to 5 until the centroid of a cluster no longer move.

Hence, this is a heuristic algorithm there is no guarantee it will converge to the global optimum and the result may depend on the initial clusters it means this algorithm is guaranteed to converge to a result but the result may be a local optimum.

To solve this problem it is advised to run this whole process multiple times with different conditions. This means with randomized staring centroids it may give a better result. As the algorithm is usually very fast it wouldn’t be any problem to run it multiple times.

--

--

Siddhraj Maramwar
Analytics Vidhya

Student of Computer Science | Data Science | Machine Learning | Always up for good challenge.