K Means β€” Clustering on coke

_>K Means algorithm

Finally we’re starting with unsupervised learning! My excitement is at peak as after series of algorithms in this section we will finally start with deep learning and will get to implement papers! How cool is that!?

Photo by Bouziane Zinou on Unsplash

But for time being lets focus more on this algorithm.

So what is K Means?

It is an unsupervised algorithm (data not labelled) that is used to partition the dataset into predefined number of clusters. The main goal is to group together all the similar points and find hidden meaning in you dataset and trust me there always is one.

Now that we know about clusters, we know that they only contain points that are similar. So we can define our aim for the algorithm β€” to minimize the distance between the points within a cluster. K-means is a centroid based/ distance based algorithm, each cluster is appointed a centroid from which the distance is calculated to all the points in that cluster. As we keep on minimizing the sum of distances between the points and the centroid, we keep getting closer to our final model with almost clear clusters defined.

Working -

- Initialization β€” We start by selecting K random points from the dataset, this points will act as the centroids. (the value of K is another story which we will soon delve into)

- Assignment β€” For each point we calculate its distance to each of the K centroids. Then we assign the point to the cluster that is closest to it, this will build K clusters on our dataset.

- Update centroids β€” We start by calculating the mean of each cluster and then start assigning it as our new centroid for that specific cluster.

- Repeat β€” Repeat steps 2 and 3 until convergence. That is, either when we reach the number of iteration or the change in centroids is minimal.

And that’s it!

Now for calculating the value of K, which can be take through trial and error (time consuming), we will study a new method to find it β€” The elbow method.

This method is a visual technique used to determine the best K value for a k-means clustering algorithm. In this method, a graph known as the elbow graph plots the within-cluster-sum-of-square (WCSS) values against various K values. The optimal K value is identified at the point where the graph bends like an elbow.

Clusters
Elbow plot (not original)

There is another version of this algorithm called as K-means ++, that is basically to solve a hidden problem. When the data is big enough to form many clusters, random choosing of centroids can be counter-productive as they might fill the gap of cluster, i.e. if we start with k = 3 but the dataset has only 2 clusters then it will force fill the gap and make 3 clusters. Because of this in K-means ++ we take the initialization of centroids as far away as possible so that through iteration it reaches the center.

A pretty simple and fun algorithm, let’s dive into its implementation!

  • Libraries and distance definition β€” We only need the numpy library for this algorithm. The distance measure we define here is the Euclidean Distance, the very basic.
import numpy as np

def eu_dist(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
  • Initialization β€” The K-means class starts here.
class KMeans:
def __init__(self, k=5, max_iters=100, plot_steps=False):
self.k = k
self.max_iters = max_iters
self.plot_steps = plot_steps

self.clusters = [[] for _ in range(self.k)]
self.centroids = []

Here the variables are:

k: Number of clusters.

max_iters: Maximum number of iterations to run the algorithm.

plot_steps: Whether to plot the steps (not used here).

clusters: List to store clusters.

centroids: List to store centroids.

  • Prediction β€” As we know that this is an unsupervised algorithm, there wont be any training loop over samples, the algorithm would directly make predictions as the data is fed to it.
def predict(self, X):
self.X = X
self.n_samples, self.n_features = X.shape

random_sample = np.random.choice(self.n_samples, self.k, replace=False)
self.centroids = [self.X[idx] for idx in random_sample]

for _ in range(self.max_iters):
self.clusters = self._create_clusters(self.centroids)

centroids_old = self.centroids
self.centroids = self._get_centroid(self.clusters)

if self._is_converged(centroids_old, self.centroids):
break

return self._get_cluster_labels(self.clusters)

X: Input data.

random_sample: Randomly selects k data points to be the initial centroids.

The algorithm creates the clusters and then moves on to iterate it to update the clusters and the centroids until convergence or until the maximum number of iterations is reached. The code is pretty much self explanatory but I will go through the custom functions ahead.

  1. Get Cluster labels β€” You can find this being used at the last of the predict function. The purpose of this function is to assign a label to each and every data-point based on the cluster it belongs to.
def _get_cluster_labels(self, clusters):
labels = np.empty(self.n_samples)
for cluster_idx, cluster in enumerate(clusters):
for sample_idx in cluster:
labels[sample_idx] = cluster_idx
return labels

"""
Initialization:

An empty array labels is created with the same length as the number of
samples in your dataset. This array will store the cluster labels for
each data point.

Assigning Labels:

The method iterates over each cluster. For each cluster, it gets the
indices of the data points that belong to that cluster. It then
assigns the cluster index (which represents the cluster number) to
each data point corresponding to those indices.

Return:

Finally, the array labels is returned, which contains the cluster
label for each data point.
"""

2. Create Clusters β€” We use this to assign all the data points to their respective nearest centroid to form clusters.

def _create_clusters(self, centroids):
clusters = [[] for _ in range(self.k)]
for idx, sample in enumerate(self.X):
centroid_idx = self._closest_centroid(sample, centroids)
clusters[centroid_idx].append(idx)
return clusters

"""
Initialization:

The method starts by initializing an empty list of clusters,
where each cluster is initially an empty list. The number of
clusters equals the number of centroids (k).

Assigning Data Points to Clusters:

The method then iterates over each data point in the dataset. For
each data point, it calculates the distance to each centroid using
the _closest_centroid method, which identifies the nearest centroid.

The data point is assigned to the cluster corresponding to its
nearest centroid.

Return:

After all data points have been assigned, the method returns the
list of clusters. Each cluster contains the indices of the data
points that belong to that cluster.
"""

3. Finding Closest Centroid β€” This method is used in the β€œCreate clusters” method, here we will use the distance function we defined at the start of the code.

def _closest_centroid(self, sample, centroids):
distances = [eu_dist(sample, point) for point in centroids]
closest_idx = np.argmin(distances)
return closest_idx

"""
Calculate Distances:

The method takes a single data point (sample) and a list of
centroids (centroids).
It calculates the distance between the sample and each centroid
using the eu_dist function. This function computes the Euclidean
distance between two points.

Create Distance List:

The calculated distances are stored in a list called distances.
Each entry in this list represents the distance from the sample
to a corresponding centroid.

Find the Closest Centroid:

The method uses np.argmin(distances) to find the index of the
smallest value in the distances list. This index corresponds to
the centroid that is closest to the sample.

Return the Index:

The index of the closest centroid is returned. This index
indicates which centroid the data point is nearest to and will
be used to assign the data point to the corresponding cluster.
"""

4. Get new Centroids β€” This method is the one which calcualtes the mean of all the data-points in a cluster to find the new centroid.

def _get_centroid(self, clusters):
centroids = np.zeros((self.k, self.n_features))
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids

"""
Initialize Centroids:

The method starts by creating an array centroids filled with zeros.
This array has dimensions (self.k, self.n_features), meaning it has
k rows (one for each cluster) and n_features columns (one for each
feature in the dataset). This array will store the new centroid positions.

Iterate Over Clusters:

The method then iterates over each cluster in the clusters list.
For each cluster, it calculates the mean of all data points belonging
to that cluster. The mean is computed across each feature (dimension)
separately, resulting in a single point (centroid) that represents the
center of that cluster.

Assign New Centroid:

The calculated mean (centroid) is then assigned to the corresponding
row in the centroids array. This row corresponds to the cluster index
(cluster_idx).

Return Centroids:

Once all clusters have been processed, the method returns the updated
centroids array, which now contains the new positions of the centroids
for the next iteration of the K-Means algorithm.
"""

5. Convergence Check β€” Here we check if the centroids have stopped changing, i.e. the distance between them has stopped increasing/decreasing.

def _is_converged(self, centroids_old, centroids):
distances = [eu_dist(centroids_old[i], centroids[i]) for i in range(self.k)]
return sum(distances) == 0

"""
Calculate Distances Between Old and New Centroids:

The method calculates the Euclidean distance between each pair of
old and new centroids. Each entry in this list represents the distance
between an old centroid and its corresponding new centroid.

Check for Convergence:

The method sums up all the distances in the distances list.
If the sum of these distances is zero, it means that all the centroids
have stopped moving. This indicates that the algorithm has converged,
as there are no further updates to the centroids.

Return Convergence Status:

The method returns True if the centroids have converged (sum of
distances is zero) and False otherwise.
"""

That was it guys for this implementation, hope you understood stuff and will implement this algorithm on your own too!

Reference β€”

Happy Coding! See ya next time!

--

--