Member-only story

Exploring KMeans Clustering: Implementation, Advantages, and Challenges

Shrinath Suresh
4 min readFeb 27, 2024

In this blog post, we will explore how the KMeans algorithm can be effectively utilized for clustering articles.

I have deep dived into the implementation in the following video

Many of the fundamental steps are discussed in my previous article on keyword extraction, which can be found here

https://medium.com/@shrinath.suresh/keyword-extraction-strategies-for-large-document-clusters-b9d3d8c7c8d5

To evaluate the effectiveness of KMeans, we will select articles from three distinct categories: Geography, Sports, and Mathematics. We will randomly sample 1000 articles from each of these categories.

pattern = r'\b(?:player|team|score)\b'
sports_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
sports_df["category"] = "sports"
sports_df.shape

pattern = r'\b(?:theorem|proof|equation|formula)\b'
maths_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
maths_df["category"] = "mathematics"
maths_df.shape

pattern = r'\b(?:river|lake|ocean)\b'
geo_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
geo_df["category"] = "geography"
geo_df.shape

Let’s combine these articles into a unified dataframe.

df = pd.concat([sports_df, maths_df, geo_df], axis=0).drop_duplicates().sample(frac=1)
df.shape

Kmeans Intuition

Let’s aim to develop a fundamental intuition of how KMeans operates. To achieve this, we’ll simplify the problem by considering nine articles (three on Sports, three on Mathematics, and three on Geography). Here’s how the articles appear in vector space.

Consider a scenario where we aim to segment these articles into three clusters. While humans can easily discern a decision boundary to separate these articles visually, KMeans accomplishes this through multiple iterations.

No responses yet

Write a response