Build K-Means from Scratch

Explore Underlying Concepts and Implementation of K-Means Algorithm

Ayo Akinkugbe

Published in

Technological Singularity

6 min readApr 27, 2024

Introduction

K-means is a type of machine learning algorithm that classifies unlabelled data into clusters. This is referred to as unsupervised learning . Unlabelled data do not have any prior classification. The K-means algorithm detects patterns and signals from the dataset and places them into k-clusters, where k is the number of clusters specified by the user. It is a prime machine learning technique for use cases when the data labels are not accessible. Real world use cases include customer segmentation, fraud detection, network analysis and cybercrime identification.

This project explores how the algorithm works, highlighting the mathematical concepts used in it development. The project:

examines the mathematical concepts behind K-means clustering
builds and implements K-means algorithm using Numpy
Visualizes the clustering iteration process using Principal Component Analysis (PCA) and Matplotlib
Evaluates accuracy on using original labels of the dataset
Compares results to Sci-kit Learn K-means implementation

Dataset

This implementation uses the Iris flower dataset. The Iris dataset is a popular dataset containing sepal length, width and petal length, width for 3 different types of irises’ (Setosa, Versicolour, and Virginica) stored in a 150x4. Each type of irises has 50 instances in the dataset. The labels are removed for K-means clustering and later used to evaluate the accuracy of the algorithm after implementation.

Theory

The K-means algorithm is a distance based algorithm. It uses the Euclidean distance from an assumed central point called a centroid to label each data point of the dataset in proximity, over several iterations until convergence. Below are key steps of the algorithm:

Choose k — distinct number of intended clusters.
Randomly create K cluster centers (centroids).
Assign each datapoint to the centroid it is closest to — This can be determined by calculating the Euclidean distance between each data point and each centroid. The Euclidean distance between two point is the L-2 norm of the two points.
Label each data point according to the nearest centroids based on the Euclidean distance
Find new centroids for each labelled cluster — This can be determined by calculating the geometric mean of each cluster.
Repeat 3, 4 and 5 until convergence or maximum number of iterations specified.

Code Implementation

A copy of the code and data files for this project can be found here.

Step 1 — Import dataset

# Import Libraries

import pandas as pd
import numpy as np

#Import data

iris = pd.read_csv('IRIS.csv')

iris

Step 2 — Clean dataset

Since K-means is a clustering algorithm, the assumption is that no label exists for the dataset. The iris dataset has a label column which will be useful later for evaluating the algorithm. This steps drops the label.

iris_u = iris.drop(columns=['species'])

iris_u

Most clustering algorithms do not work well with missing values. This step ensures rows with missing values in dataset are dropped.

features = ["sepal_length", "sepal_width", "petal_length", "petal_width"]



iris_u = iris_u.dropna(subset = features)

iris_u

Step 3 —Scale dataset

This step scales the dataset to avoid prominence of large values on the algorithm.

iris_u  = ((iris_u - iris_u.min()) / (iris_u.max() - iris_u.min())) * 5 + 1
iris_u
data = iris_u.copy()
data

Step 4— Create Random Centroids

This step samples centroids from the data using the python sample function and converts the sample to a float. Rather than using the random function, this method keeps the centroids in bounds of the dataset.

def random_centroids(dataset, k):
  centroids = []
  for i in range(k):
      centroid = data.apply(lambda x: float(x.sample()))
      centroids.append(centroid)
  return pd.concat(centroids, axis=1)

Step 5— Assign Cluster to Each Point in Dataset

To assign each datapoint to a cluster, this step calculates the Euclidean distance (or L-2 Norm) using the linalg.norm function and uses the idxmin function to find the minimum distances to each centroids.


def get_labels(dataset, centroids):
    distances = centroids.apply(lambda x: np.linalg.norm(data - x, axis=1))
    return distances.idxmin(axis=1)

labels = get_labels(data, centroids)

labels

Step 6— Create New Centroids

To create new centroids, this function:

Groups data by labels
Calculates the geometric mean by taking the log of each value x in the group, exponentiating the result and computing the mean of the result.

def new_centroids(data, labels, k):
    centroids = data.groupby(labels).apply(lambda x: np.exp(np.log(x).mean())).T
    return centroids

Step 7 [Optional] — Create Iteration Visualization using PCA

This steps uses the PCA algorithm to reduce the dimension of the dataset and plot the new dimension using matplotlib to visualize the clustering iterations.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from IPython.display import clear_output

def plot_iterations(dataset, labels, centroids, iteration):
  pca = PCA(n_components=2)
  dataset_2d = pca.fit_transform(dataset) #Transform data to 2 dimensions
  centroids_2d = pca.transform(centroids.T) #Transform centroids to 2 dimension and transpose
  clear_output(wait=True) #Clears after each iteration
  plt.title(f'Iteration {iteration}')
  plt.scatter(x=dataset_2d[:,0], y=dataset_2d[:,1], c=labels) #plots 2 dimensional data coloring with labels derived from distance from centroid  calculation of data
  plt.scatter(x=centroids_2d[:,0], y=centroids_2d[:,1]) #plots centroids
  plt.show()

Step 8—Create Function Incorporating Repeat Loop

This step aggregates all previous function into one function. It also incorporates a print statement to show the number of points clustered into each category after the last iteration.

def constructed_K_means(max_iterations, centroid_count, iterations ):

  centroids = random_centroids(data, centroid_count)
  old_centroids = pd.DataFrame()

  while iterations < max_iterations and not centroids.equals(old_centroids):
    old_centroids = centroids
      
    labels = get_labels(data, centroids)
    centroids = new_centroids(data, labels, centroid_count)
    plot_iterations(data, labels, centroids, iterations)
    iterations += 1
  
  print (f'Datapoints in each cluster: ')
  print (f'{labels.value_counts()}')


max_iterations = 70
centroid_count = 3
iterations = 50

constructed_K_means(max_iterations, centroid_count, iterations)

Evaluation

The section evaluates the results of the algorithm with the original data labels dropped before processes. First the label categories are encoded

original_labels = iris['species']

# Define a dictionary to map categories to integers
category_map = {'Iris-setosa': 0, 'Iris-virginica': 1, 'Iris-versicolor': 2}

# Use the map function to replace categories with integers
original_labels = original_labels.map(category_map)

original_labels

Then a plot showing the original data label distribution is plotted using PCA and matplotlib. The encoded categories are used to color the datapoints in the visualization.

def plot_data(dataset, labels):
  pca = PCA(n_components=2)
  dataset_2d = pca.fit_transform(dataset)
  plt.title(f'Real Classes')
  plt.scatter(x=dataset_2d[:,0], y=dataset_2d[:,1], c=labels)
  plt.show()

plot_data(iris_u, original_labels)

print (f'Datapoints in each Label: ')
print (f'{original_labels.value_counts()}')

Below is the visuals of the constructed k-means algorithm clustering of the same dataset

Compare Results

Overall, it’s hard to compare results with the Sklearn output as clustering is not a prediction. Also the clustering each run. However this shows that the clusters size of the Sklearn are the constructed algorithm.

from sklearn.cluster import KMeans

kmeans = KMeans(3)
kmeans.fit(data)

Print and compare centroids.

pd.DataFrame(kmeans.cluster_centers_, columns=features).T

Print and compare cluster sizes.

cluster_counts = np.bincount(kmeans.labels_)
for cluster_id, count in enumerate(cluster_counts):
    print(f"Cluster {cluster_id}: {count} points")

Conclusion

Based on its derivation, K-means algorithm makes some fundamental assumptions making it suitable for certain use cases. These assumptions include

Clusters are assumed to be spherical and isotropic (equal in all directions. This is evident from the way the algorithm is calculated as the centroids are the mean of each clusters.)
Clusters have same variance
Clusters have similar sizes
Clusters have the same probability

Though useful for many applications, the algorithm might not work well for the following scenarios

Large datasets — as it is computationally expensive — The time complexity of K-means algorithm is O(NTK), where N is the total number of data sets, K is the total number of clusters, and T is the number of iterations in the clustering process.
Though a clustering algorithm, It might be hard using K-means to detect if clustering exists in a dataset as it clusters uniform data as well.
Clustering outliers — The algorithm doesn’t work well in clustering outliers.
Density-Based Clustering

A benefit of dissecting ML algorithms as this project does is understanding the algorithm’s strength and flaws, thereby allowing for customization when required.