K-Means

Atulanand
CodeX
Published in
7 min readNov 5, 2022
Photo by Boitumelo Phetla on Unsplash

It is a clustering algorithm that aims to group similar entities in one cluster and works well with numerical data.

Steps

1. Number of clusters (k) is chosen

2. It randomly initialize the cluster centers of each cluster i.e. the algorithm starts with a first group of randomly selected centroids i.e. randomly selecting few points considering it to be the centroids of the clusters.

3. For each data point, compute the Euclidian distance from all the centroids and assign the cluster based on the minimal distance to all the centroids. A centroid is the imaginary or real location representing the center of the cluster

4. This step is move centroid step. K means moves the centroid of each cluster by taking the average of all the data points in the cluster. In other words, the algorithm calculates the new centroid (average of all the points in a cluster) of each cluster

It halts creating and optimizing clusters when either:

- The centroids have stabilized i.e. there is no change in their values because the clustering has been successful

  • The defined number of iterations has been achieved.

Methods for find optimal number of clusters:

ELBOW METHOD:

It is the most popular method for determining the optimal number of clusters. The method is based on calculating the Within-Cluster-Sum of Squared Errors (WSS) for different number of clusters (k) and selecting the k for which change in WSS first starts to diminish.

The idea behind the elbow method is that the explained variation changes rapidly for a small number of clusters and then it slows down leading to an elbow formation in the curve. The elbow point is the number of clusters we can use for our clustering algorithm. Further details on this method can be found in this paper by Chunhui Yuan and Haitao Yang.

We will be using the YellowBrick library which can implement the elbow method with few lines of code. It is a wrapper around Scikit-Learn and has some cool machine learning visualizations!

pip install yellowbrick(if not installed)from yellowbrick.cluster import KElbowVisualizermodel = KMeans()# k is range of number of clusters.visualizer = KElbowVisualizer(model, k=(2,30), timings= True)visualizer.fit(cluster_df)        # Fit data to visualizervisualizer.show()        # Finalize and render figure

The KElbowVisualizer function fits the KMeans model for a range of clusters values between 2 to 30. The elbow point is achieved with 8 clusters which is highlighted by the function itself. The function also informs us about how much time was needed to plot models for various numbers of clusters through the green line.

Silhouette Coefficient:

The Silhouette Coefficient for a point i is defined as follows:

where b(i) is the smallest average distance of point i to all points in any other cluster and a(i) is the average distance of i from all other points in its cluster. For example, if we have only 3 clusters A,B and C and i belongs to cluster C, then b(i) is calculated by measuring the average distance of i from every point in cluster A, the average distance of i from every point in cluster B and taking the smallest resulting value. The Silhouette Coefficient for the dataset is the average of the Silhouette Coefficient of individual points.

The Silhouette Coefficient tells us if individual points are correctly assigned to their clusters. We can use the following thumb rules while using Silhouette Coefficient:

  1. S(i) close to 0 means that the point is between two clusters
  2. If it is closer to -1, then we would be better off assigning it to the other clusters
  3. If S(i) is close to 1, then the point belongs to the ‘correct’ cluster
from yellowbrick.cluster import KElbowVisualizermodel = KMeans()# k is range of number of clusters.visualizer = KElbowVisualizer(model, k=(2,30),metric='silhouette', timings= True)visualizer.fit(cluster_df)        # Fit the data to the visualizervisualizer.show()        # Finalize and render the figure

Gap Statistic:

The gap statistic was developed by Stanford researchers Tibshirani, Walther and Hastie in their 2001 paper. The idea behind their approach was to find a way to compare cluster compactness with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Their estimate for the optimal number of clusters is the value for which cluster compactness on the original data falls the farthest below this reference curve. This information is contained in the following formula for the gap statistic:

where Wk is measure of the compactness of our clustering based on the Within-Cluster-Sum of Squared Errors (WSS):

Within-Cluster-Sum of Squared Errors is calculated by the inertia_ attribute of KMeans function as follows:

  • The square of the distance of each point from the centre of the cluster (Squared Errors)
  • The WSS score is the sum of these Squared Errors for all the points

Calculating gap statistic in python for k means clustering involves the following steps:

  • Cluster the observed data on various number of clusters and compute compactness of our clustering
  • Generate reference data sets and cluster each of them with varying number of clusters. The reference datasets are created from a “continuous uniform” distribution using the random_sample function.
  • Calculate average of compactness of our clustering on reference datasets
  • Calculate gap statistics as difference in compactness between clustering on reference data and original data
def optimalK(data, nrefs=3, maxClusters=15):"""Calculates KMeans optimal K using Gap StatisticParams:data: ndarry of shape (n_samples, n_features)nrefs: number of sample reference datasets to createmaxClusters: Maximum number of clusters to test forReturns: (gaps, optimalK)"""gaps = np.zeros((len(range(1, maxClusters)),))resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})for gap_index, k in enumerate(range(1, maxClusters)):# Holder for reference dispersion resultsrefDisps = np.zeros(nrefs)# For n references, generate random sample and perform kmeans getting resulting dispersion of each loopfor i in range(nrefs):# Create new random reference setrandomReference = np.random.random_sample(size=data.shape)# Fit to itkm = KMeans(k)km.fit(randomReference)refDisp = km.inertia_refDisps[i] = refDisp# Fit cluster to original data and create dispersionkm = KMeans(k)km.fit(data)origDisp = km.inertia_# Calculate gap statisticgap = np.log(np.mean(refDisps)) - np.log(origDisp)# Assign this loop's gap statistic to gapsgaps[gap_index] = gapresultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)return (gaps.argmax() + 1, resultsdf)score_g, df = optimalK(cluster_df, nrefs=5, maxClusters=30)plt.plot(df['clusterCount'], df['gap'], linestyle='--', marker='o', color='b')plt.xlabel('K')plt.ylabel('Gap Statistic')plt.title('Gap Statistic vs. K')

Dendrogram:

This technique is specific to the agglomerative hierarchical method of clustering. The agglomerative hierarchical method of clustering starts by considering each point as a separate cluster and starts joining points to clusters in a hierarchical fashion based on their distances. In a separate blog, we will focus on the details of this method. To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters.

If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters. We will plot the graph using the dendogram function from scipy library.

import scipy.cluster.hierarchy as shcfrom matplotlib import pyplotpyplot.figure(figsize=(10, 7))pyplot.title("Dendrograms")dend = shc.dendrogram(shc.linkage(cluster_df, method='ward'))

We can chose the optimal number of clusters based on hierarchical structure of the dendrogram. As highlighted by other cluster validation metrics, 4 clusters can be considered for the agglomerative hierarchical as well.

Bayesian information criterion:

Bayesian information criterion (BIC) score is a method for scoring a model which is using the maximum likelihood estimation framework. The BIC statistic is calculated as follows:

BIC = (k*ln(n)) — (2ln(L))

where L is the maximized value of the likelihood function of the model, k is the number of parameters and n is the number of records

The lower the BIC score, better is the model. We can use the BIC score for the Gaussian Mixture Modelling approach for clustering. We will discuss details of this model in a separate blog but key thing to note here is that in the model we need to select number of clusters as well as type of covariance. We try out various combinations of the parameters and select the model with the lowest BIC score.

from sklearn.mixture import GaussianMixturen_components = range(1, 30)covariance_type = ['spherical', 'tied', 'diag', 'full']score=[]for cov in covariance_type:for n_comp in n_components:gmm=GaussianMixture(n_components=n_comp,covariance_type=cov)gmm.fit(cluster_df)score.append((cov,n_comp,gmm.bic(cluster_df)))score

--

--

Atulanand
CodeX
Writer for

Data Scientist @Deloitte | Infosys Certified Machine Learning Professional | Google Certified Data Analytics