K-Means Clustering
In-depth explanation of the popular k-means algorithm including implementation in Python from scratch
Clustering is the task of grouping a set of objects, such that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters).
K-means is a centroid-based clustering technique that partitions the dataset into k distinct clusters, where each data point belongs to the cluster with the nearest center. It is one of the simplest and most efficient methods for clustering. K-means has been successfully employed in numerous applications across diverse areas such as customer segmentation, image compression, document clustering, and anomaly detection.
In this article we will explore the k-means algorithm, implement it from scratch in Python, and discuss its various variants and limitations.
Centroid-Based Clustering
In centroid-based clustering, each cluster is represented by a central vector, called the cluster center or centroid, which is not necessarily a member of the dataset. The cluster centroid is typically defined as the mean of the points that belong to that cluster.
Our goal in centroid-based clustering is to divide the data points into k clusters in such a way that the points are as close as possible to the centroids of the clusters they belong to.