Steps involved in K-means
K-means is one of the simplest clustering algorithm which can be used. Though there are some advantages in using K-means at the same time there are some disadvantages with K-means. Let’s start discussing with steps involved in performing K-means then we can move on their advantages and disadvantages.
Before performing K-means on a data set having n data points upon which you wanted to create ‘k’ clusters then initially you need to define ‘k’ number of random cluster centers or ‘k’ number of cluster centers from the data itself. Here comes again different initialization techniques of initializing ‘k’ cluster centers are: either using K means ++ algorithm in which the initial cluster centers are defined such that each cluster center is far away from each other so that it results in faster convergence of clustering process and stable clusters formation Or performing multiple number of iterations and selection ‘k’ number of initial cluster centers such that there sum of square distance from a data point to its cluster centers is minimum. By default when performing K-means in python the initialization happens through K means ++ algorithm. Once initialization of cluster centers is done there exists two steps in K means which happens repeatedly those are:
- Assignment
- Optimization
In Assignment step every data point is assigned to one of the ‘k’ cluster centers considered initially by computing distance measure called Euclidean distance between a data point and all possible cluster centers considered initially. The minimum of Euclidean distance is considered and accordingly a particular data point is assigned to one of the ‘k’ considered cluster centers. Since every data point has to be assigned to one of the cluster center, K- means clustering is generally called as Hard Clustering. The cost function associated with the assignment step is:
Zi= argmin||Xi−μk||²
- Where Zi corresponds to ‘k’ cluster centers
- argmin function returns the minimum value of norm function
- ||Xi−μk||² represents the norm function of the square of the distance calculated between a data point Xi w.r.t one of the ‘k’ considered cluster center (μk)
Like such Euclidean distance is calculated for every data point. Considering the minimum Euclidean distance between a data point and one of the ‘k’ initial cluster center that particular data point is assigned to that cluster center.
Once assignment of data point is done to one of ‘k’ clusters we move on to Optimization step. In optimization step the cluster center are recalculated considering the mean of the data points belonging to a cluster. The cost function associated with optimization step is :
μk = (1/ nk) ∑i:zi=k Xi
- Where μk represents new cluster center
- (1/ nk) ∑i:zi=k Xi represents the mean of the data points belonging to an cluster
Repeatedly Assignment and Optimization steps are performed until optimal clustering is achieved i.e. no more cluster centers change. In optimal clustering the data points within the cluster will have tightness or cohesiveness and between the clusters there exists huge separation distance.
Now let’s discuss some pro’s and con’s of the K-means clustering algorithm. Pro’s:
- K-means is an similar clustering technique
- Less computational power required.
Con’s:
when using K-means we need to pre determine the optimal number of clusters to be considered and accordingly those many number of cluster centers are initialized. In every initialization of ‘k’ cluster centers it results in different final clusters formed. In order to tackle this problem we perform multiple iterations of initialization and the consider that particular initialized cluster centers in which sum of squared error is minimized. Also to overcome the problem of pre determining the number of clusters to be formed we move on to another clustering algorithm called Hierarchical clustering. In which we no need to pre determine the number of clusters to be formed. But there exists some limitations with Hierarchical clustering like it is expensive in terms of computation power hence it is not preferably performed in big data sets. But if there exists suitable platform cloud services then we can perform Hierarchical clustering even on big data sets. K-means algorithm can be applied only to numerical data sets. If wanted to perform clustering technique on data sets having categorical we proceed with K-modes and when having data including both categorical and numerical in such case we use K-prototype clustering technique.
Generally, in industry when performing clustering technique we start with Hierarchical clustering and then determine the optimal number of clusters to be formed. Considering that optimal value of clusters we proceed with K-means algorithm.
To practically visualize the K-means clustering process in online there is a useful link: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
References: Upgrad Learning Platform