K-means Clustering in machine learning:

Mukesh Chaudhary
7 min readMay 29, 2020

--

Some examples with python code

There are three type of machine learning . One is supervised machine learning , second is unsupervised machine learning and last one is reinforcement machine learning. Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. Here , I am trying to explain one of most popular unsupervised machine learning which is k-means Clustering in simple way with some python code examples. Let’s start

K-means Clustering

K-means clustering is a simplest and popular unsupervised machine learning algorithms. In unsupervised machine learning , we just feed unlabeled data for understanding nature of data structure like groups , cluster etc. Let’s try to understand by mathematical term and flow chart.

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize objective function, or, the sum of squared error (sse) function:

The Algorithm is composed of the following steps

  1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
  2. Assign each object to the group that has the closest centroid.
  3. When all objects have been assigned, recalculate the positions of the K centroids.
  4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Example:
Suppose that we have n sample feature vectors x1, x2, …, xn all from the same class, and we know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in cluster i if || xmi || is the minimum of all the k distances. This suggests the following procedure for finding the k means:

  • Make initial guesses for the means m1, m2, …, mk
  • Until there are no changes in any mean
  • Use the estimated means to classify the samples into clusters
  • For i from 1 to k
  • Replace mi with the mean of all of the samples for cluster i
  • end_for
  • end_until

Here is an example showing how the means m1 and m2 move into the centers of two cluster

Example with Python code

Let’s see some steps for example with python programing code. Here , I’ll use scikit-learn library for K-means clustering algorithm and dataload module for iris dataset. I am trying to cluster petal length and width by using k-means clustering algorithm and let’s see what happen.

# import neccessaries librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler,MinMaxScaler
# load petal datadata = datasets.load_iris()dir(data)# load into Dataframe
df = pd.DataFrame(data.data,columns = data.feature_names)
print(df.shape)
df.head()
df1= df.drop(['sepal length (cm)', 'sepal width (cm)'],axis = 'columns')
df1.head()
df1= df.drop(['sepal length (cm)', 'sepal width (cm)'],axis = 'columns')
df1.head()
df1= df.drop(['sepal length (cm)', 'sepal width (cm)'],axis = 'columns')
df1.head()
# scatter plot for petal length and widthplt.scatter(df1['petal length (cm)'],df1['petal width (cm)'])# initialize object for k-means clustering algorithm
km = KMeans(n_clusters=3)
y_predict = km.fit_predict(df1[['petal length (cm)','petal width (cm)']])df2 = df1.copy()
df2['cluster'] = y_predict
# cluster centroidkm.cluster_centers_

After fit k-mean clustering algorithm , we can plot .

def plot_kmean(df_frame):
df3= df_frame[df_frame.cluster ==0]
df4= df_frame[df_frame.cluster ==1]
df5= df_frame[df_frame.cluster ==2]
plt.scatter(df3['petal length (cm)'],df3['petal width (cm)'],color = 'green',label ='cluster=0')
plt.scatter(df4['petal length (cm)'],df4['petal width (cm)'],color = 'red',label ='cluster=1')
plt.scatter(df5['petal length (cm)'],df5['petal width (cm)'],color = 'black',label ='cluster=2')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color ='purple' , marker = "*",label ='Centroid',
s = 400)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.legend(loc = 'best')


plot_kmean(df2)

Output:

In above code and output , we chose number 3 for n_clusters of KMeans class . Then it showed how it clustered into three group (cluster).

Evaluation Methods

In supervised learning , we can easily evaluate model’s performance by several metrics like confusion matrix , mean square error etc but in clustering analysis, it doesn’t have a solid evaluation metric that we can use to evaluate the outcome of different clustering algorithms. Even k-means requires k as an input and doesn’t learn it from data, there is no right answer in terms of the number of clusters that we should have in any problem. Sometimes domain knowledge and data analysis experience may help but usually that is not the case.

Generally ,There are two metrics that may give us some intuition about k :

  • Elbow method
  • Silhouette analysis

However , in the post , i’ll only cover only elbow method . I will discuss another metrics in another blog post .

Elbow Method

Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. Let’s see one picture i download from internet that clealy show about elbow intuition.

Let’s try implement elbow method on above python code how it is .

# elbow techniquek_rng = range(1,10)
sse_scaler = []
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df_scaler[['petal length (cm)','petal width (cm)']])
sse_scaler.append(km.inertia_)
#plot
plt.plot(k_rng,sse_scaler)
plt.xlabel("k")
plt.ylabel("Sum of squared error")

Output:

Drawbacks

Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like shape. It always try to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, Kmeans does a poor job in clustering the data. Another drawback is we have to put k clusters number at advance otherwise it doesn't work. By default , KMeans class of sklearn take k =8 . In additional , k-means algorithm can be viewed as a greedy algorithm for partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the cluster centers. Let’s see some of pictures that show how poor performance happen on these dataset.

Picture 1:

Picture 2:

Above , we can clearly see that k-mean clustering algorithm can not handle like these data. In this post , I just mentioned drawback by some of pictures . I am not going to more details about drawback and detail how to handle these problem. Obviously , I will be on next blog.

Conclusion:

K-means clustering is one of the most popular clustering algorithms and used to get an intuition about the structure of the data. The goal of k-means is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shapes. However, it also shows poor performance in different shapes data . Below are some of keys point that we should keep remember while we use k-mean clustering algorithm:

  • Scale/standardize the data when applying k-means algorithm.
  • Use Elbow method in selecting number of clusters .
  • K-means gives more weight to the bigger clusters.
  • K-means assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters.

References:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.htmlhttps://scikit-learn.org/stable/datasets/index.html#loading-other-datasetshttps://en.wikipedia.org/wiki/K-means_clustering

--

--