K-means Clustering — Everything you need to know
When I was learning about K-means clustering, I had to go through several blogs and videos to gather all the information that I wanted to know about K-means clustering. This tutorial is just an attempt to include all the concepts learned from various sources in a single blog.
Table of Content
- What is K-means Clustering?
- Our Goal?
- How does this algorithm work?
- How to randomly initialize centroids?
* Random Initialization
* K-means ++ - How to optimize?
- Evaluation metrics for clustering?
- Stopping criteria for clustering?
- Assumptions of K-means
- Challenges of K-means
- Data issues to be handled?
- Applications of K-means
- References
What is K-means Clustering?
It is an algorithm that helps us to group similar data points together. It is a partitioning problem, so if we have m data points then we will need to partition them into K-clusters
Our Goal?
We want to optimize our algorithm so that —
1. We want that data points inside clusters are as similar as possible
2. Try keeping clusters as far as possible(as dissimilar or different)
How does this algorithm work?
input -
1. Training Set : x1, x2………. , xm
2. No of clusters you want for your data set : K
Pseudo Algo -
How to randomly initialize centroids?
In above algorithm the 2 most important considerations for K-means are —
- How to decide no of centroids?
- How to initialize centroids?
We will be looking into how to select no of clusters later in this tutorial.
- Random Initialization -
- We can randomly pick K data points form our training set and initialize centroids to these data points.
- Problem with random initialization — Every time we run our algorithm, out initial centroids will be different. So K-means can end up with different solutions based on initialization of centroids. (LOCAL OPTIMA)
- K-means ++
Instead of picking all centroids randomly for initialization, we pick only 1 random centroid here.
- Randomly pick 1 data point as an initial centroid.
- Now calculate distance of each data point with centroid.
- Next centroid = Squared distance farthest from current centroid.
K-means ++ is costly than random initialization
How to avoid Local Optima? Or Optimize for better Local Optima?
There is high possibility that we will run into Local Optima while implementing K-means algorithm. So our objective is to find best local optima.
Multiple random initialization:
This method is used if K === small (2–10)
Algo:
Evaluation metrics for clustering -
Inertia — Tries to make compact clusters
It calculates distance of all points in a cluster from centroid.
Lower the inertia better the clusters are
Dunn Index — dissimilarity between clusters
Takes into account distance between clusters so that clusters are as different from each other as possible.
Higher the Dunn index, better the clusters are.
Stopping criteria for K means
1.Max number of iterations reached.
2. Centroids of newly formed clusters do not change much.
3. points remain in same cluster.
Assumptions of K-means
- Limited to spherical shaped clusters
If you want to know clusters that will be formed by K-means, just imagine a spherical like shape. As K-means calculates distance from centroid, it forms a spherical shape. Thus, it cannot cluster complicated geometrical shape.
Solution — KERNEL method
transform to higher dimensional representation which makes data linearly separable.
from sklearn.cluster import SphericalClustering - Size of clusters
This algorithm considers only distance, thus does not account for clusters with different sizes or densities. It assumes that features within a cluster have equal variance.
Challenges of K-means
- Local Optima —
No of clusters increases local optima. Thus our clusters depends on centroid initialization - K-
cannot learn number of clusters from data. - Slow —
As size of our data set increases, algorithm becomes slow. As for each iteration of K, it has to access every point in data set.
Solution — Mini Batch K means - Cluster uniform data —
Even if there are no logical clusters in data set, this algorithm can cluster uniform data as well. - Gives more weight to bigger clusters
Data Issues to be handled for applying K-means
- Outliers — As we are using distance based approach, K-means is sensitive to outliers.
- Categorical Data — K means cannot handle categorical data. This can be dealt in 3 ways —
1. Convert categorical variables to numerical — → Scale the data — — > apply K-means
2. Use Hamming distance instead of Euclidean distance. [If 2 categorical values are same, make distance ==0 else 1]
3. Compute mode. - Dimension reduction — If data is high dimensional, it is good to reduce dimensionality of our data
- Missing Values — These should be treated.
- Multi col-linearity — not badly affected since it calculates distance. However if we delete some features based on collinearity, we might bring some samples closer together.
Applications of K-means
- Data or image compression — With 1 byte 256 colors can be addressed. For each pixel, we have 3 bytes for RGB. Now if we want to decrease no of colors, we can use K-means to identify which colors to use. We can treat each pixel as a data point.
Implementation — https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html - Outlier detection —
Distance based approach —
Find out threshold value for data — — → If distance > threshold: mark as outlier — → remove outliers
Cluster based approach -
Find out smallest cluster — — -> consider points in smallest cluster as outlier.