K-Means 101
This article provides a simple introduction to the K-means algorithm.
It is one of the clustering methods used to find patterns in a dataset that has no target. This type of pattern finding is called ‘unsupervised learning’ and can be used when there is no prior knowledge about the data. What clustering does under the hood is use similar features to separate data points into groups based on a defined measure.
How does it work?
What the k-means method does is look at a dataset and separate the data into K-defined groups or clusters based using the average of the entire group known as a centroid. Each centroid of a cluster is a collection of feature values that define the resulting groups. Examining the centroids can be used to interpret what kind of group each cluster represents.
Since the goal is to create groups of variables that are highly similar within each group without crossing over into other groups, the value of k, the number of clusters, is a hyperparameter that has to be tuned by the data analyst. For example, setting k to ‘5’ will split your data into five groups. There are some techniques for selecting k. None of them is proven optimal. Most of those techniques require the analyst to make an “educated guess” by looking at some metrics or by examining cluster assignments visually. If you set k to the exact number of data points found in your dataset, each data point automatically becomes an independent cluster. If you set k to 1, then all data points will be regarded as homogenous and produce only one cluster.
Assumptions of K-means
1. K-means clustering requires all variables to be continuous
2. K-means clustering also requires a specification of the number of clusters, k.
Strengths of K-Means
- It is easy to understand and identify unknown groups of data from complex data sets
2. It easily adapts to new data
3. It can easily adjust to hyperparameter changes. When centroids are recomputed, the cluster changes.
4. K-means segmentation is linear in the number of data objects thus increasing execution time. It doesn’t take more time in classifying similar characteristics in data.
Weaknesses of K-means
1. K-means can only handle numerical data
2. The user needs to specify k.
3. The algorithm is sensitive to outliers and noisy data
4. K-means assumes that each cluster has roughly equal numbers of observations
5. K-means is unable to identify clusters that have arbitrary shapes.
Application areas
K-means is a robust algorithm that can be used for any type of grouping. Some use cases are:
A. Segment by activities on application, website, or platform
B. Group images
C. Identify groups in health monitoring
D. Fraud detection
E. Categorized academic performance into grades like A, B, or C.
F. Recommendation systems that adapt to user preferences
For more information on the K-means algorithim and how it can be used in python, check out this article from datacamp.