Good Old K-Means With a Twist

Elli Tzini
The Startup
Published in
6 min readJan 27, 2021

Do we really know everything about K-Means?

Photo by Daniel Fazio on Unsplash

Clustering is not new to the Machine Learning neighborhood, but it will definitely never get old. Grouping data points with similar traits into clusters without knowing their corresponding labels, or any other prior information for that matter, sounds pretty cool to me. Let’s dive deeper into one of the most fundamental and most used algorithms that falls under the hood of Clustering; K-means.

Table of Contents

K-means

The algorithm aims to partition the data points into k sets (i.e. the clusters) by minimizing the set variances. The latter simply means minimizing the distance from the centroid (center of a cluster).

Oftentimes, it is the case that we have additional information about our data points. Perhaps, each sample belongs to a certain group i.e. male/female, long term/new user, countries etc. We could therefore assign a weight to each sample given the group they belong to. For this purpose, we use Weighted K-means.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate data
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
# K-means
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_
# Weighted K-means
weights = list(map(lambda x: x*10, y_true))
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X, sample_weight=weights)
y_kmeans_w = kmeans.predict(X, sample_weight=weights)
centers_w = kmeans.cluster_centers_
# Plotting
fig, ax = plt.subplots(1, 3, figsize=(20,7))
colors = ['#d35400', '#34495e', '#2980b9']
versions = ['Original Data', 'K-means', 'Weighted K-means']
y = [y_true, y_kmeans, y_kmeans_w]
centers_ = [centers, centers_w]
for i in range(3):
ax[i].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y[i])), alpha=0.5)
elif i>0:
ax[i].scatter(centers_[i-1][:, 0], centers_[i-1][:, 1], marker='+', c='black', s=200)
ax[i].set_title(versions[i])
ax[i].set_xticks([])
ax[i].set_yticks([])
plt.show()

For the purpose of this section, we randomly generated a dataset of 3 clusters with a total of 3000 data points. The weights are assigned in the following way w_i = true_cluster_index * 10. Evidently, the clusters and their corresponding centroids are very much affected by the resulting weight vector. One should be very cautious on when and how to use Weighted K-means as it can have a big impact on the final centers of the clusters.

MiniBatch K-means

K-means is very powerful but sometimes it might be impossible to fit your dataset due to its large size. In such cases, you may use MiniBatch K-means, which is using mini-batches, moving the centroids just slightly at each iteration. This also results into a faster convergence but most importantly, it allows clustering of huge datasets that do not fit in memory.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import time
# Generate data
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
# K-means
kmeans = KMeans(n_clusters=n_clusters)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_
mbk = MiniBatchKMeans(n_clusters=n_clusters, batch_size=45,
max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0
y_kmeans_batch = mbk.predict(X)
centers_b = mbk.cluster_centers_
# Plotting
fig, ax = plt.subplots(1, 2, figsize=(15,7))
colors = ['#d35400', '#34495e', '#2980b9']
versions = ['K-means', 'MiniBatch K-means']
y = [y_kmeans, y_kmeans_batch]
t = [t_batch, t_mini_batch]
centers_ = [centers, centers_b]
for i in range(2):
ax[i].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y[i])), alpha=0.5)
ax[i].scatter(centers_[i][:, 0], centers_[i][:, 1], marker='+', c='black', s=200)
ax[i].set_title(versions[i] + f' | train time: {np.round(t[i], 2)}fs')
ax[i].set_xticks([])
ax[i].set_yticks([])
plt.show()

Fuzzy c-means

Originally, K-means is a hard clustering (i.e. clusters do not overlap). What if samples belong to multiple clusters? Then, we need algorithms that perform soft clustering (i.e. each sample has a probability of belonging to a cluster).

The process is the same with K-means but the centroid of a cluster is calculated as the mean of all points, weighted by their probability of belonging to cluster i.

import skfuzzy as fuzz
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate data
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
# Fuzzy c-means
cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(X.T, n_clusters, 2, error=0.005, maxiter=1000)
# Plotting
colors = ['#d35400', '#34495e', '#2980b9']
fig, ax = plt.subplots(1, 2, figsize=(15,7))
ax[0].scatter(X[:, 0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y_true)), alpha=0.5)
ax[0].set_title('Original Data')
ax[0].set_xticks([])
ax[0].set_yticks([])
cluster_membership = np.argmax(u, axis=0)
for j in range(n_clusters):
cX = X[cluster_membership == j]
ax[1].scatter(cX[:, 0], cX[:, 1], s=50, c=colors[j], alpha=0.5)

for pt in cntr:
ax[1].scatter(pt[0], pt[1], marker='+', c='black', s=200)
ax[1].set_title(f'Fuzzy c-means | FPC {np.round(fpc, 2)}')
ax[1].set_xticks([])
ax[1].set_yticks([])

FPC (fuzzy partition coefficient) is a performance metric that ranges from 0 to 1, with 1 being the best. In our case, we achieved FPC of 0.68. In our problem, each point belongs to a single cluster, thus np.argmax(u, axis=0) , however, one can investigate further the probabilities of membership for each cluster. For example, we could define a threshold and find which samples belong to multiple clusters or a single cluster after thresholding.

K-medoids

Say hello to K-means cousin, K-medoids. Probably you have already guessed that there are not too many differences between the 2 algorithms, and you are right. The main differences are the following:

  1. The centers (medoids) are now actual data points from our dataset — something which is not a necessity in K-means.
    > greater interpretability of the cluster centers than in K-means
  2. Instead of Euclidean distance, K-medoids is minimizing the sum of pairwise dissimilarity measures.
    > more robust to outliers and noise

In the following script, we investigate the results of K-medoids using 3 pairwise dissimilarity measures — Manhattan, Cosine and Euclidean — and K-means.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
import matplotlib.pyplot as plt
# Generate data
centers=[[1,1], [-1,-1], [1,-1]]
n_clusters = len(centers)
X, y_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
# K-medoids models
selected_models = [(KMedoids(metric="manhattan", n_clusters=n_clusters), "KMedoids (manhattan)"),
(KMedoids(metric="euclidean", n_clusters=n_clusters), "KMedoids (euclidean)"),
(KMedoids(metric="cosine", n_clusters=n_clusters), "KMedoids (cosine)"),
(KMeans(n_clusters=n_clusters), 'KMeans')]
colors=['#d35400', '#34495e', '#2980b9']fig, ax = plt.subplots(2, 2, figsize=(15, 15))
perm = list(set(itertools.permutations([0,1,1,0], 2)))
for s in range(4):
i, j = perm[s]
model, name = selected_models[s]
model.fit(X)
y_hat = model.predict(X)
centers_hat = model.cluster_centers_
inertia = model.inertia_
ax[i, j].scatter(X[:,0], X[:, 1], s=50, c=list(map(lambda x: colors[x], y_hat)), alpha=0.5)
ax[i, j].scatter(centers_hat[:, 0], centers_hat[:, 1], marker='+', c='black', s=200)
ax[i, j].scatter(np.array(centers)[:, 0], np.array(centers)[:, 1], marker='o', c='black', s=200, alpha=0.6)
ax[i, j].set_title(name)
ax[i, j].set_xticks([])
ax[i, j].set_yticks([])
plt.show()

In each plot the centers of the clusters are represented with a cross, while the true centers are presented with a circle. Given this information, by comparing all 4, we can see that cosine is performing the worst compared to the rest. The K-medoids using Euclidean distance gives very similar results with K-means, which was expected. Lastly, Manhattan distance is giving very similar results with Euclidean distance.

K-means is a relatively simple algorithm and yet very powerful. Thanks to its easy and fast implementation, it has rightly become one of the most popular algorithms for clustering tasks.

On the other hand, it should be noted that K-means is only guaranteed to find a local minimum, but it is guaranteed that it will always converge. One of the most irritating assumptions is that all clusters are spherical of similar size. This stems from the fact that the cluster value is simply the mean of all the points in the cluster.

--

--