K Means — Clustering on coke

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

7 min readSep 11, 2024

_>K Means algorithm

Finally we’re starting with unsupervised learning! My excitement is at peak as after series of algorithms in this section we will finally start with deep learning and will get to implement papers! How cool is that!?

But for time being lets focus more on this algorithm.

So what is K Means?

It is an unsupervised algorithm (data not labelled) that is used to partition the dataset into predefined number of clusters. The main goal is to group together all the similar points and find hidden meaning in you dataset and trust me there always is one.

Now that we know about clusters, we know that they only contain points that are similar. So we can define our aim for the algorithm — to minimize the distance between the points within a cluster. K-means is a centroid based/ distance based algorithm, each cluster is appointed a centroid from which the distance is calculated to all the points in that cluster. As we keep on minimizing the sum of distances between the points and the centroid, we keep getting closer to our final model with almost clear clusters defined.

Working -

- Initialization — We start by selecting K random points from the dataset, this points will act as the centroids. (the value of K is another story which we will soon delve into)

- Assignment — For each point we calculate its distance to each of the K centroids. Then we assign the point to the cluster that is closest to it, this will build K clusters on our dataset.

- Update centroids — We start by calculating the mean of each cluster and then start assigning it as our new centroid for that specific cluster.

- Repeat — Repeat steps 2 and 3 until convergence. That is, either when we reach the number of iteration or the change in centroids is minimal.

And that’s it!

Now for calculating the value of K, which can be take through trial and error (time consuming), we will study a new method to find it — The elbow method.

This method is a visual technique used to determine the best K value for a k-means clustering algorithm. In this method, a graph known as the elbow graph plots the within-cluster-sum-of-square (WCSS) values against various K values. The optimal K value is identified at the point where the graph bends like an elbow.

There is another version of this algorithm called as K-means ++, that is basically to solve a hidden problem. When the data is big enough to form many clusters, random choosing of centroids can be counter-productive as they might fill the gap of cluster, i.e. if we start with k = 3 but the dataset has only 2 clusters then it will force fill the gap and make 3 clusters. Because of this in K-means ++ we take the initialization of centroids as far away as possible so that through iteration it reaches the center.

A pretty simple and fun algorithm, let’s dive into its implementation!

Libraries and distance definition — We only need the numpy library for this algorithm. The distance measure we define here is the Euclidean Distance, the very basic.

import numpy as np

def eu_dist(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

Initialization — The K-means class starts here.

class KMeans:
    def __init__(self, k=5, max_iters=100, plot_steps=False):
        self.k = k
        self.max_iters = max_iters
        self.plot_steps = plot_steps

        self.clusters = [[] for _ in range(self.k)]
        self.centroids = []

Here the variables are:

k: Number of clusters.
max_iters: Maximum number of iterations to run the algorithm.
plot_steps: Whether to plot the steps (not used here).
clusters: List to store clusters.
centroids: List to store centroids.

Prediction — As we know that this is an unsupervised algorithm, there wont be any training loop over samples, the algorithm would directly make predictions as the data is fed to it.

def predict(self, X):
    self.X = X
    self.n_samples, self.n_features = X.shape

    random_sample = np.random.choice(self.n_samples, self.k, replace=False)
    self.centroids = [self.X[idx] for idx in random_sample]

    for _ in range(self.max_iters):
        self.clusters = self._create_clusters(self.centroids)

        centroids_old = self.centroids
        self.centroids = self._get_centroid(self.clusters)

        if self._is_converged(centroids_old, self.centroids):
            break

    return self._get_cluster_labels(self.clusters)

X: Input data.
random_sample: Randomly selects k data points to be the initial centroids.
The algorithm creates the clusters and then moves on to iterate it to update the clusters and the centroids until convergence or until the maximum number of iterations is reached. The code is pretty much self explanatory but I will go through the custom functions ahead.

Get Cluster labels — You can find this being used at the last of the predict function. The purpose of this function is to assign a label to each and every data-point based on the cluster it belongs to.

def _get_cluster_labels(self, clusters):
    labels = np.empty(self.n_samples)
    for cluster_idx, cluster in enumerate(clusters):
        for sample_idx in cluster:
            labels[sample_idx] = cluster_idx
    return labels

"""
Initialization:

    An empty array labels is created with the same length as the number of 
    samples in your dataset. This array will store the cluster labels for 
    each data point.

Assigning Labels:

    The method iterates over each cluster. For each cluster, it gets the 
    indices of the data points that belong to that cluster. It then 
    assigns the cluster index (which represents the cluster number) to 
    each data point corresponding to those indices.

Return:

    Finally, the array labels is returned, which contains the cluster 
    label for each data point.
"""

2. Create Clusters — We use this to assign all the data points to their respective nearest centroid to form clusters.

def _create_clusters(self, centroids):
    clusters = [[] for _ in range(self.k)]
    for idx, sample in enumerate(self.X):
        centroid_idx = self._closest_centroid(sample, centroids)
        clusters[centroid_idx].append(idx)
    return clusters

"""
Initialization:

    The method starts by initializing an empty list of clusters, 
    where each cluster is initially an empty list. The number of 
    clusters equals the number of centroids (k).

Assigning Data Points to Clusters:

    The method then iterates over each data point in the dataset. For 
    each data point, it calculates the distance to each centroid using 
    the _closest_centroid method, which identifies the nearest centroid.
    
    The data point is assigned to the cluster corresponding to its 
    nearest centroid.

Return:

    After all data points have been assigned, the method returns the 
    list of clusters. Each cluster contains the indices of the data 
    points that belong to that cluster.
"""

3. Finding Closest Centroid — This method is used in the “Create clusters” method, here we will use the distance function we defined at the start of the code.

def _closest_centroid(self, sample, centroids):
    distances = [eu_dist(sample, point) for point in centroids]
    closest_idx = np.argmin(distances)
    return closest_idx

"""
Calculate Distances:

    The method takes a single data point (sample) and a list of 
    centroids (centroids).
    It calculates the distance between the sample and each centroid 
    using the eu_dist function. This function computes the Euclidean 
    distance between two points.

Create Distance List:

    The calculated distances are stored in a list called distances. 
    Each entry in this list represents the distance from the sample 
    to a corresponding centroid.

Find the Closest Centroid:

    The method uses np.argmin(distances) to find the index of the 
    smallest value in the distances list. This index corresponds to 
    the centroid that is closest to the sample.

Return the Index:

    The index of the closest centroid is returned. This index 
    indicates which centroid the data point is nearest to and will 
    be used to assign the data point to the corresponding cluster.
"""

4. Get new Centroids — This method is the one which calcualtes the mean of all the data-points in a cluster to find the new centroid.

def _get_centroid(self, clusters):
    centroids = np.zeros((self.k, self.n_features))
    for cluster_idx, cluster in enumerate(clusters):
        cluster_mean = np.mean(self.X[cluster], axis=0)
        centroids[cluster_idx] = cluster_mean
    return centroids

"""
Initialize Centroids:

    The method starts by creating an array centroids filled with zeros. 
    This array has dimensions (self.k, self.n_features), meaning it has 
    k rows (one for each cluster) and n_features columns (one for each 
    feature in the dataset). This array will store the new centroid positions.

Iterate Over Clusters:

    The method then iterates over each cluster in the clusters list. 
    For each cluster, it calculates the mean of all data points belonging 
    to that cluster. The mean is computed across each feature (dimension) 
    separately, resulting in a single point (centroid) that represents the 
    center of that cluster.

Assign New Centroid:

    The calculated mean (centroid) is then assigned to the corresponding 
    row in the centroids array. This row corresponds to the cluster index 
    (cluster_idx).

Return Centroids:

    Once all clusters have been processed, the method returns the updated 
    centroids array, which now contains the new positions of the centroids 
    for the next iteration of the K-Means algorithm.
"""

5. Convergence Check — Here we check if the centroids have stopped changing, i.e. the distance between them has stopped increasing/decreasing.

def _is_converged(self, centroids_old, centroids):
    distances = [eu_dist(centroids_old[i], centroids[i]) for i in range(self.k)]
    return sum(distances) == 0

"""
Calculate Distances Between Old and New Centroids:

    The method calculates the Euclidean distance between each pair of 
    old and new centroids. Each entry in this list represents the distance 
    between an old centroid and its corresponding new centroid.

Check for Convergence:

    The method sums up all the distances in the distances list.
    If the sum of these distances is zero, it means that all the centroids 
    have stopped moving. This indicates that the algorithm has converged, 
    as there are no further updates to the centroids.

Return Convergence Status:

    The method returns True if the centroids have converged (sum of 
    distances is zero) and False otherwise.
"""

That was it guys for this implementation, hope you understood stuff and will implement this algorithm on your own too!

Reference —

Happy Coding! See ya next time!

K Means — Clustering on coke

_>K Means algorithm

Written by Aditya Chandhoke