Exploring KMeans Clustering: Implementation, Advantages, and Challenges

Shrinath Suresh
4 min readFeb 27, 2024

In this blog post, we will explore how the KMeans algorithm can be effectively utilized for clustering articles.

I have deep dived into the implementation in the following video

Many of the fundamental steps are discussed in my previous article on keyword extraction, which can be found here

https://medium.com/@shrinath.suresh/keyword-extraction-strategies-for-large-document-clusters-b9d3d8c7c8d5

To evaluate the effectiveness of KMeans, we will select articles from three distinct categories: Geography, Sports, and Mathematics. We will randomly sample 1000 articles from each of these categories.

pattern = r'\b(?:player|team|score)\b'
sports_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
sports_df["category"] = "sports"
sports_df.shape

pattern = r'\b(?:theorem|proof|equation|formula)\b'
maths_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
maths_df["category"] = "mathematics"
maths_df.shape

pattern = r'\b(?:river|lake|ocean)\b'
geo_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
geo_df["category"] = "geography"
geo_df.shape

--

--