Introduction to K-Mean Clustering

Hack A BIT
hackabit
Published in
4 min readOct 18, 2019

Required Knowledge: Unsupervised Machine Learning

K-Mean Clustering is generally used when we have uncategorized data. It attempts to group individuals in a population together by similarity. This algorithm provides an insight into what types of classification exist in complex data sets. To put it simply, K-Means finds k number of centroids, and then assigns all data points to the closest cluster.

K-means is a simple yet powerful algorithm that has been adapted to solve many real-world problems. We will discuss these applications later in this article.

Let’s see how it works.

Algorithm

The algorithm works majorly in two steps.

1. Assign data points to its nearest centroid. Each data point X will be assigned to its nearest centroid Using this relation: X ϵ argmin( dist( X , Ci ) ). Here, dist( X , Ci ) calculates the Euclidean distance between point and the centroid Ci and argmin() return the cluster which is nearest to X.

2. Reposition the centroids to the Mean of assigned data points.

Here, Mean is the updated position of centroid and xi are the data points assigned to this centroid.

Dry Run:

Consider the following data points with K = 2 (number of clusters). Position of the centroids is decided at random. The algorithm will give the same result irrespective of the initial position of the centroid.

Step 1:

Points nearest to Red centroid are assigned Red colour.

Points nearest to Blue centroid are assigned Blue colour.

Step 2:

Reposition the centroids for red and blue points.

Step 3:

Mark the data points near to their Centroids.

Step 4:

Re-compute their Centroid.

This process will continue until there is no change in Centroid’s Position.

Purity of a Cluster:

Purity is the percent of the total number of objects (data points) that were classified correctly, in the unit range [0 . . 1].

where N = number of objects (data points), k = number of clusters, ci is a cluster in C, and tj is the classification which has the max count for cluster ci.

When we say “correctly” it implies that each cluster ci has identified a group of objects as the same class that the ground truth has indicated. We use the ground truth classification ti of those objects as the measure of assignment correctness, however, to do so we must know which cluster ci maps to which ground truth classification ti.

If it were 100% accurate then each ci would map to exactly 1 ti, but in reality, our ci contains some points whose ground truth classified them as several other classifications. Naturally, then we can see that the highest clustering quality will be obtained by using the ci to ti mapping which has the most number of correct classifications i.e. ci ∩ ti. That is where the max comes from in the equation.

Code:

This is a sample code that implements k — means. For this we first need to import two important libraries.

 numpy: to generate input data

 sklearn: to implement k — means in python

In the code below, you can specify the number of clusters. For this example, assign 2 clusters as follows:

n_clusters: The number of clusters to form as well as the number of centroids to generate.

random_state: Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

predict: It computes the nearest centroid to the given data points.

cluster_centers_: Outputs the coordinates of the k — centroids formed.

Applications of K-Mean Clustering:

1. Segment Customer Data on the basis of sale.

2. Separate audio.

3. Rearrange inventory on the basis of frequency of purchase.

4. Delivery Store Optimization

5. Identifying Crime Localities

Conclusion:

K-Means is one of the simplest unsupervised learning algorithms. However, we need to specify the number of clusters in advance and the final results depend on initialization and often terminates at a local optimum. Unfortunately, there is no theoretical way to compute the number of clusters in advance. A practical approach is to run this algorithm for different values of k and choose the best one. Overfitting may occur for large values of k. Once the algorithm has successfully characterized our data, new data can be added easily using supervised learning techniques.

References:

https://stats.stackexchange.com/questions/95731/how-to-calculate-purity

https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

https://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.predict

Written by-

Aryan Dosaj

Birla Institute of Technology, Mesra

--

--