K-Means Clustering in Data Mining

A Beginner’s Guide to K-Means Clustering

Dushanthi Madhushika
LinkIT
4 min readSep 3, 2021

--

Photo by National Cancer Institute on Unsplash

Clustering is a machine learning approach for grouping a collection of provided data points into clusters or groups based on comparable characteristics or qualities in the data set. When clustering the data points in the same cluster are similar (intra cluster distance is minimized) and the data points in separate clusters are less similar (inter cluster distance is maximized).

K-means clustering is considered one of the most popular unsupervised (unlabeled data are used for clustering)and hard clustering approaches ( data points can only belong to one cluster)in data mining. In this algorithm, data is divided into clusters, with the number of clusters determined by the user.

The method allocates data points to clusters recursively according to the dataset’s supplied characteristics. In each cycle, a data point is assigned to its nearest cluster based on a distance measure.

Initial cluster centers are defined according to the user-defined number of clusters. Then the distance is computed between each of the random cluster centers and the data points. Every data point will then be assigned to a less distant cluster. Then the mean of the data points is calculated and taken as the next center in each cluster. This method is iterated until cluster centers are stable.

Let’s understand the algorithm through the following example.

Cluster following data points into 3 clusters using 2 iterations. Distance function between two points a=(x1, y1) and b=(x2, y2) is as:

D(a, b) = |x2 — x1| + |y2 — y1|

Data points — Image by Author

Step 01: Pick a number (K) of cluster centers.

We have to select the K number of initial cluster centers. Consider initial cluster center points as A1, A4, and A7.

The following graph shows the distribution of the given dataset.

Data Distribution — Image by Author

Step 02: Assign every item to its nearest cluster center.

For this, we have to calculate the distance between each point and three cluster centers respectively.

E.g.:- Distance between cluster center 1 and point A2.

C1 = A1 as given above

D(C1,A2) = |2–2|+|10–5|= 5

Likewise, we have to calculate the distance difference for each point with each cluster point. Then assign each point to a cluster such that the difference between the cluster center and the point is minimal.

Calculated distance differences (Iteration 01) — Image by Author
Identified Clusters (Iteration 01) — Image by Author

Step 03: Move each cluster center to the mean of its assigned items.

Now we have to calculate mean values for each cluster and move cluster centers to mean values.

Following are new cluster centers after the first iteration.

Cluster 01 — (2,10)

Cluster 02 —((8+5+7+6+4)/5, (4+8+5+4+9)/5) = (6,6)

Cluster 03 — ((2+1)/2, (5+2)/2) = (1.5,3.5)

New Cluster Centers (Iteration 01) — Image by Author

Step 04: Repeat steps 2-3 until convergence (change in cluster assignments less than a threshold).

So we have to repeat steps 2 and 3 until we find stable cluster centers. As per the example let’s do up to two iterations.

The following shows the distance differences after the second iteration.

Calculated distance differences (Iteration 02) — Image by Author
Identified Clusters (Iteration 02) — Image by Author

Then calculate new cluster centers by calculating mean values for each cluster. The following shows new cluster centers after two iterations.

Cluster 01 — ((2+4)/2,(10+9)/2) = (3,9.5)

Cluster 02 — ((8+5+7+6)/4, (4+8+5+4)/4) = (6.5,6.5)

Cluster 03 —(1.5,3.5) no change

If you want to have stable clusters, you have to continue iterations until cluster centers become stable.

New Cluster Centers (Iteration 02) — Image by Author

This is how K-Means clustering works on clustering datasets. However, there are some problems with this clustering algorithm.

  1. Results can vary significantly depending on the initial choice of seeds.
  2. Can get trapped into a local minimum.

Regardless of the limitations, K-Means clustering is frequently used in Data Science.

Thank you for reading so far and I hope you learned something. If you enjoy my article, make sure to hit the clap button.

--

--

Dushanthi Madhushika
LinkIT
Writer for

Tech enthusiast. A Graduate at Faculty of Information Technology University of Moratuwa