Member-only story
Exploring KMeans Clustering: Implementation, Advantages, and Challenges
In this blog post, we will explore how the KMeans algorithm can be effectively utilized for clustering articles.
I have deep dived into the implementation in the following video
Many of the fundamental steps are discussed in my previous article on keyword extraction, which can be found here
To evaluate the effectiveness of KMeans, we will select articles from three distinct categories: Geography, Sports, and Mathematics. We will randomly sample 1000 articles from each of these categories.
pattern = r'\b(?:player|team|score)\b'
sports_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
sports_df["category"] = "sports"
sports_df.shape
pattern = r'\b(?:theorem|proof|equation|formula)\b'
maths_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
maths_df["category"] = "mathematics"
maths_df.shape
pattern = r'\b(?:river|lake|ocean)\b'
geo_df = df[df['articles'].str.contains(pattern, case=False)].sample(frac=1)[:1000]
geo_df["category"] = "geography"
geo_df.shape
Let’s combine these articles into a unified dataframe.
df = pd.concat([sports_df, maths_df, geo_df], axis=0).drop_duplicates().sample(frac=1)
df.shape
Kmeans Intuition
Let’s aim to develop a fundamental intuition of how KMeans operates. To achieve this, we’ll simplify the problem by considering nine articles (three on Sports, three on Mathematics, and three on Geography). Here’s how the articles appear in vector space.

Consider a scenario where we aim to segment these articles into three clusters. While humans can easily discern a decision boundary to separate these articles visually, KMeans accomplishes this through multiple iterations.