First day on deep learning-Clustering: Beginners must know K-means Clustering and how to implement it in Python

Published in

Predict

3 min readJul 6, 2023

What is clustering?

Clustering is a fundamental data analysis technique that involves the grouping of similar objects or data points together based on their characteristics. It helps in discovering inherent structures, similarities, and relationships within datasets. Clustering finds applications in diverse domains, including customer segmentation, image recognition, anomaly detection, document clustering, and recommendation systems.

What is the K-means algorithm?

K-means clustering has emerged as one of the most widely used and well-established techniques among the various clustering algorithms available. It is an unsupervised learning algorithm that partitions data points into distinct groups or clusters based on their similarity. K-means is relatively simple, efficient, and effective in various applications. Let’s delve deeper into the K-means algorithm and understand how it works.

Clustering is a process of organizing data points into groups or clusters based on their similarity. The goal is to create clusters such that objects within the same cluster are more similar to each other than to those in other clusters. Similarity is typically measured using distance metrics, such as Euclidean distance or cosine similarity.

The K-means algorithm is an iterative, centroid-based clustering technique. It starts by randomly initializing K cluster centroids, where K represents the desired number of clusters. The algorithm then iteratively assigns each data point to the nearest centroid and recalculates the centroid’s position based on the mean of the data points assigned to it. This process continues until convergence, where the centroids stabilize, and data points no longer change clusters significantly.

Illustration of How K-means Clustering Works Step-by-step

Step 1: Initialization

Randomly select K initial centroids.

Step 2: Assignment

Calculate the distance between each data point and all centroids.
Assign each data point to the nearest centroid.

Step 3: Update

Recalculate the position of each centroid by computing the mean of the data points assigned to it.

Step 4: Iteration

Repeat Steps 2 and 3 until convergence.

Step 5: Convergence

The algorithm converges when the centroids no longer change significantly, or a predefined number of iterations is reached.

Import library

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Load your dataset

dataset = pd.read_csv('yourdataset.csv')
X = dataset.iloc[:, [3, 4]].values

Using the elbow method to find the optimal number of clusters

from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Training the K-Means model on the dataset

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)