Data Science (Python) :: K-Means Clustering

Intention of this post is to give a quick refresher (thus, it’s assumed that you are already familiar with the stuff) on “K-Means Clustering”. You can treat this as FAQ’s or Interview Questions as well.

How does K-Mean algorithm works?
K-Means algorithm is used for identifying clusters in a given dataset. For this learning sake, let’s assume that we have 2 independent variables (plotted on X & Y). Each point of the dependent variable is plotted on graph.
Step 1 :- Decide on number of cluster you want. For this e.g, let’s take K = 2
Step 2 :- Based on the chosen cluster, identify the points as center points. In this case, identify any 2 points on the graph and mark them as center points (C1, C2).
Step 3 :- Now, for each data point, classify them into 1st or 2nd cluster based on the closest point. For e.g, a point of (2,5) may be closest to center C1 than to center C2. In this case, point (2,5) will be marked into a cluster which has a center point as C1.
Step 4 :- After classifying all data points into C1 or C2, now you will have few points which are close to C1 and rest are close to C2. Based on these points, calculate new center point for data points which were in C1 group. So, C1 will move to a new point. The same will happen to C2.
Step 5 :- Repeat Step 3, Step 4 until a point is reached where C1 & C2 don’t move any further!
Thants K-Means for you!

******************************************

What is random initialization trap?
Sometimes it’s quite possible that, we might be choosing a initial K center points in such a way that the algorithm gives a false positive model. This can be avoided using K++ means.

******************************************

What is WCSS
Within Cluster Sum of Squares. For e.g, let’s take there are 3 clusters. That means, we have 3 center points (C1, C2, C3). Each data point falls into the zone of either C1 or C2 or C3. 
First we calculate the sum of squares of the distance of each data point in cluster 1 from their center point C1. Let’s say there are 3 points in cluster 1 (c1p1, c1p2, c1p3).
[dist(C1, c1p1) ]² + [dist(C1, c1p2)]² + [dist(C1, c1p3)]². This is cluster 1 sum of squares.
Similarly we do the same for C2 & C3. Now, we add the sum of all 3 ‘clusters sum of squares’ to get WCSS.

******************************************

How to choose the right number of clusters? Elbow method?
WCSS always decreases with the increase in the number of clusters. However, it should be noted that, the rate of drop in WCSS starts to drop as we increase the number of clusters. This would be our hint. We need to stop at the number of clusters from where the rate of drop in WCSS doesn’t drop substantially (in other words, the rate of drop is very less). This is sometimes also termed as Elbow method.

******************************************

Sample code for implementing K-Means clustering algorithm?
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
 kmeans = KMeans(n_clusters = i, init = ‘k-means++’, max_iter = 300, n_init = 10, random_state = 0)
 kmeans.fit(X)
 wcss.append(kmeans.inertia_)
 
plt.plot(range(1,11), wcss)
plt.title(‘The elbow method’)
plt.xlabel(‘The number of clusters’)
plt.ylabel(‘WCSS’)
plt.show()
# Applying K-Means to the dataset IBased on the above Elbow plot of WCSS)
# Based on the above Elbow method, we get 5 as number of optimum clusters and thus we take K as 5 for our K-Means model
kmeans = KMeans(n_clusters = 5, init = ‘k-means++’, max_iter = 300, n_init = 10, random_state = 0)
y_kmeans = kmeans.fit_predict(X)
# Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = ‘red’, label = ‘Cluster 2’)
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = ‘blue’, label = ‘Cluster 2’)
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = ‘green’, label = ‘CLuter 3’)
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = ‘cyan’, label = ‘Cluster 4’)
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = ‘magenta’, label = ‘Cluster 5’)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 300, c = ‘yellow’, label = ‘Centroids’)
plt.title(‘CLusters of clients’)
plt.xlabel(‘Annual income (k$)’)
plt.ylabel(‘Spending score (1–100)’)
plt.legend()
plt.show()

*****************************************

Next :: Data Science (Python) :: Hierarchical Clustering

Prev :: Data Science :: Performance of Each Classification Model

If you liked this article, please hit the ❤ icon below