Choosing the Best K Value for K-means Clustering
There are many machine learning algorithms used for different applications. Some of them are called “supervised” and some are “unsupervised”. Today, we will be talking about an unsupervised algorithm, which is called K-means clustering.
Probably, there are lots of posts out there about how K-means clustering works or how it can be implemented with Python. For this reason, we will merely be focusing on choosing the best K value in order to obtain better results. What we mean by “better results” is obviously grouping the data points as effectively as possible.
It might be a smart idea to sweep through the K values within a range and cluster the data points into K different groups every time. After each clustering is completed, we can check some metrics in order to decide whether we should choose the current K or continue evaluating.
One of these metrics is the total distance (it is called as “inertia” in sklearn library) . Inertia shows us the sum of distances to each cluster center. If the total distance is high, it means that the points are far from each other and might be less similar to each other. In this case we can choose to continue evaluating higher K values in order to see if we can reduce the total distance. However, I should emphasize a really important point here. It’s not always the smartest idea to decrease the distance. Let’s assume that we have 100 data points. If we choose K to be 100, we will end up with a distance value which is equal to 0. But, obviously, it is not something that we wish. We want to have a few number of “good” clusters which contain sufficient information about the data points and do not have any noise or outliars.
Well, the point where we can feel ourselves okay is called “elbow point”. When you plot the Total Distance (Inertia ) vs. K-Value graph, you will observe that after a point, total distance will start changing insignificantly compared to previous changes. At this point we can conclude that the data points are clustered sufficiently and further clustering will not contribute more information to our system. Thus, we can choose to stop here and proceed with the K-value corresponding to elbow point.
Now, let’s make our hands dirty and write a Python code to better understand how we can find the best K.
We will use a pretty famous dataset called Iris. Iris dataset contains three different types of flowers and each has 4 features. We will want our K-means algorithm to find the best structure of grouping.
#IMPORT LIBRARIES
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
import matplotlib.pyplot as plt#LOAD IRIS DATASET
iris=datasets.load_iris()
x=iris.data
target=iris.target#Create a function that calculates Inertia for n times
#We will sweep through 1 to n to find the optimal cluster numberdef cluster_variance(n):
variances=[]
kmeans=[]
outputs=[]
K=[i for i in range(1,n+1)] for i in range(1,n+1):
variance=0
model=KMeans(n_clusters=i,random_state=82,verbose=2).fit(x)
kmeans.append(model)
variances.append(model.inertia_)
return variances,K,nvariances,K,n=cluster_variance(10)plt.plot(K,variances)
plt.ylabel("Inertia ( Total Distance )")
plt.xlabel("K Value")
plt.xticks([i for i in range(1,n+1)])
plt.show()
As you can see in the figure above, after K=3 , reduction in total distance changes more slowly and the slope of the line significantly decreased. From this figure, we can infer that going beyond this K value(3) will not contribute much to our clustering algorithm and will only make our clusters more complicated. Finally, I want to tell you that this result is pretty consistent with our expectations. Since Iris dataset contains three different types of flowers, it is not surprising to end up with 3 clusters being the most effective method for this case.