K-Means Clustering
Clustering or Cluster Analysis is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.
Types of Clustering:
- K-Means Clustering.
- Hierarchical Clustering.
In today’s blog we will be discussing about K-Means Clustering.
K-Means Intuition: K-Means is an iterative algorithm that tries to partition data in K distinct non overlapping clusters. Steps involved in K-Means Clustering:
- Choose the number K of clusters.
- Select at random K points, the centroid (not necessarily from the dataset).
- Assign each data point to the closest centroid that form K clusters. (Choose appropriate distance metrics for finding closest distance based on the business problem in hand).
- Compute and place the new centroid of each cluster.
- Reassign each data point to the new closest centroid. If any reassignment took place go to Step 4 otherwise STOP.
Examples:
- Market Segmentation
- Fraud detection
- To simply identify some clusters of your customers in your company or business.
Selecting K or The Appropriate Number of Clusters:
- Within Cluster Sum of Squares (WCSS): WCSS is the sum of squared distance between each point and the centroid in a cluster. As the number of clusters increases, the WCSS value will start to decrease substantially to 0 when each data point represent a single cluster.
We choose the appropriate number of cluster when the difference in WCSS for different number of cluster goes from a large value to significantly low value. The popular method to understand this is the Elbow Method
Advantage of K-Means algorithm:
- Easy and Simple to implement.
- Scales to large datasets very well.
- Convergence is guaranteed.
- Works well on testing data.
- Generalizes very well.
Disadvantage of K-Means algorithm:
- Random Initialization Trap: Choosing the initial random cluster centroid is important because if it is done wrong then it might lead the K-Means algorithm to iterate in incorrect manner thereby forming incorrect clusters. Can be overcomed by using K-Means++ algorithm.
- Choosing optimal number of K.
- Clustering data of varying sizes and density.
- Prone to outliers.
- If the data contains a large number of features then it might suffer from Curse of Dimensionality.
Implementation: Refer the following link for Python and R implementation of K-Means Clustering: